You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "stevomitric (via GitHub)" <gi...@apache.org> on 2024/03/26 10:20:17 UTC

[PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

stevomitric opened a new pull request, #45721:
URL: https://github.com/apache/spark/pull/45721

   ### What changes were proposed in this pull request?
   Added normalization of map keys when they are put in `ArrayBasedMapBuilder`. 
   
   
   ### Why are the changes needed?
   To allow GROUP BY statements for Map Types as per [this discussion](https://github.com/apache/spark/pull/45549#discussion_r1537803505).
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   New UTs in `ArrayBasedMapBuilderSuite`
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on PR #45721:
URL: https://github.com/apache/spark/pull/45721#issuecomment-2020096455

   please add info in the description on why the change is needed, ie right now we can create a map with both keys -0 and 0 etc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1539209968


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala:
##########
@@ -52,18 +54,36 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
 
   private val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
 
+  private lazy val keyNeedNormalize = NormalizeFloatingNumbers.needNormalize(keyType)
+
+  def normalize(value: Any, dataType: DataType): Any = dataType match {

Review Comment:
   we should return a lambda function to do normalization based on the data type, instead of matching the data type per row.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #45721:
URL: https://github.com/apache/spark/pull/45721#issuecomment-2022475248

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "stevomitric (via GitHub)" <gi...@apache.org>.

stevomitric commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1539019584


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala:
##########
@@ -52,18 +54,34 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
 
   private val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
 
+  def normalize(value: Any, dataType: DataType): Any = dataType match {
+    case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
+    case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
+    case ArrayType(dt, _) =>
+      new GenericArrayData(value.asInstanceOf[GenericArrayData].array.map { element =>

Review Comment:
   Applied `NormalizeFloatingNumbers.needNormalize` here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1538972121


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilderSuite.scala:
##########
@@ -60,6 +60,40 @@ class ArrayBasedMapBuilderSuite extends SparkFunSuite with SQLHelper {
     )
   }
 
+  test("apply key normalization when creating") {
+    val builderDouble = new ArrayBasedMapBuilder(DoubleType, IntegerType)
+    builderDouble.put(-0.0, 1)
+    checkError(
+      exception = intercept[SparkRuntimeException](builderDouble.put(0.0, 2)),
+      errorClass = "DUPLICATED_MAP_KEY",
+      parameters = Map(
+        "key" -> "0.0",
+        "mapKeyDedupPolicy" -> "\"spark.sql.mapKeyDedupPolicy\"")
+    )
+
+    val builderArray = new ArrayBasedMapBuilder(ArrayType(DoubleType), IntegerType)
+    builderArray.put(new GenericArrayData(Seq(-0.0)), 1)
+    checkError(
+      exception = intercept[SparkRuntimeException](
+        builderArray.put(new GenericArrayData(Seq(0.0)), 1)),
+      errorClass = "DUPLICATED_MAP_KEY",
+      parameters = Map(
+        "key" -> "[0.0]",
+        "mapKeyDedupPolicy" -> "\"spark.sql.mapKeyDedupPolicy\"")
+    )
+
+    val builderStruct = new ArrayBasedMapBuilder(new StructType().add("i", "double"), IntegerType)

Review Comment:
   maybe add a case when array is inside of a struct



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1538972769


##########
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilderSuite.scala:
##########
@@ -60,6 +60,40 @@ class ArrayBasedMapBuilderSuite extends SparkFunSuite with SQLHelper {
     )
   }
 
+  test("apply key normalization when creating") {

Review Comment:
   add another test for successful normalization



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1539213087


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala:
##########
@@ -52,18 +54,36 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
 
   private val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
 
+  private lazy val keyNeedNormalize = NormalizeFloatingNumbers.needNormalize(keyType)
+
+  def normalize(value: Any, dataType: DataType): Any = dataType match {
+    case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
+    case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
+    case ArrayType(dt, _) =>

Review Comment:
   no need to handle complex types, as we use `TreeMap` for complex types which should handle floating points well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1538971557


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala:
##########
@@ -52,18 +54,34 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
 
   private val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
 
+  def normalize(value: Any, dataType: DataType): Any = dataType match {
+    case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
+    case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
+    case ArrayType(dt, _) =>
+      new GenericArrayData(value.asInstanceOf[GenericArrayData].array.map { element =>

Review Comment:
   if we have an array of 1 million strings we will go through each value even though we know we don't need to normalize strings
   
   what about doing the same as in `NormalizeFloatingNumbers` and first check if we need to perform normalization



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "stevomitric (via GitHub)" <gi...@apache.org>.

stevomitric commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1539520949


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala:
##########
@@ -52,18 +54,36 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
 
   private val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
 
+  private lazy val keyNeedNormalize = NormalizeFloatingNumbers.needNormalize(keyType)
+
+  def normalize(value: Any, dataType: DataType): Any = dataType match {
+    case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
+    case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
+    case ArrayType(dt, _) =>
+      new GenericArrayData(value.asInstanceOf[GenericArrayData].array.map { element =>
+        normalize(element, dt)
+      })
+    case StructType(sf) =>
+      new GenericInternalRow(
+        value.asInstanceOf[GenericInternalRow].values.zipWithIndex.map { element =>
+        normalize(element._1, sf(element._2).dataType)

Review Comment:
   As noted by @cloud-fan below, complex types have been dropped.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1540464966


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala:
##########
@@ -52,18 +53,27 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
 
   private val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
 
+  private lazy val keyNeedNormalize =
+    keyType.isInstanceOf[FloatType] || keyType.isInstanceOf[DoubleType]
+
+  def normalize(dataType: DataType): Any => Any = dataType match {

Review Comment:
   ```
   private lazy val keyNormalizer: Any => Any = keyType match {
     case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER
     case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER
     case _ => identity
   }
   ```
   then the can just write
   ```
   val keyNormalized = keyNormalizer(key)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #45721: [SPARK-47563][SQL] Add map normalization on creation
URL: https://github.com/apache/spark/pull/45721


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-47563][SQL] Add map normalization on creation [spark]

Posted by "stefankandic (via GitHub)" <gi...@apache.org>.

stefankandic commented on code in PR #45721:
URL: https://github.com/apache/spark/pull/45721#discussion_r1539067436


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ArrayBasedMapBuilder.scala:
##########
@@ -52,18 +54,36 @@ class ArrayBasedMapBuilder(keyType: DataType, valueType: DataType) extends Seria
 
   private val mapKeyDedupPolicy = SQLConf.get.getConf(SQLConf.MAP_KEY_DEDUP_POLICY)
 
+  private lazy val keyNeedNormalize = NormalizeFloatingNumbers.needNormalize(keyType)
+
+  def normalize(value: Any, dataType: DataType): Any = dataType match {
+    case FloatType => NormalizeFloatingNumbers.FLOAT_NORMALIZER(value)
+    case DoubleType => NormalizeFloatingNumbers.DOUBLE_NORMALIZER(value)
+    case ArrayType(dt, _) =>
+      new GenericArrayData(value.asInstanceOf[GenericArrayData].array.map { element =>
+        normalize(element, dt)
+      })
+    case StructType(sf) =>
+      new GenericInternalRow(
+        value.asInstanceOf[GenericInternalRow].values.zipWithIndex.map { element =>
+        normalize(element._1, sf(element._2).dataType)

Review Comment:
   you could also check if you need to do normalization here right?
   
   this way we would avoid normalization of all fields of a struct if only one actually needs it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org