You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "zhengruifeng (via GitHub)" <gi...@apache.org> on 2024/02/07 07:03:41 UTC

[PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

zhengruifeng opened a new pull request, #45056:
URL: https://github.com/apache/spark/pull/45056

   ### What changes were proposed in this pull request?
   Make `ProtoUtils.abbreviate` support repeated fields
   
   
   ### Why are the changes needed?
   existing implementation does not work for repeated fields (strings/messages)
   
   we don't have `repeated bytes` in Spark Connect for now, so let it alone
   
   ### Does this PR introduce _any_ user-facing change?
   no
   
   
   ### How was this patch tested?
   added UTs
   
   ### Was this patch authored or co-authored using generative AI tooling?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on PR #45056:
URL: https://github.com/apache/spark/pull/45056#issuecomment-1933492153

   thank you guys, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #45056:
URL: https://github.com/apache/spark/pull/45056#discussion_r1482317218


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:
##########
@@ -42,7 +42,17 @@ private[connect] object ProtoUtils {
         val size = string.length
         val threshold = thresholds.getOrElse(STRING, MAX_STRING_SIZE)
         if (size > threshold) {
-          builder.setField(field, createString(string.take(threshold), size))
+          builder.setField(field, truncateString(string, threshold))
+        }
+
+      case (field: FieldDescriptor, strings: java.lang.Iterable[_])
+          if field.getJavaType == FieldDescriptor.JavaType.STRING && field.isRepeated
+            && strings != null =>
+        val threshold = thresholds.getOrElse(STRING, MAX_STRING_SIZE)
+        strings.iterator().asScala.zipWithIndex.foreach {
+          case (string: String, i) if string != null && string.length > threshold =>

Review Comment:
   I guess so, but not very sure.
   So I added `string != null` to protect from NPE



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #45056:
URL: https://github.com/apache/spark/pull/45056#discussion_r1481292685


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:
##########
@@ -69,23 +79,33 @@ private[connect] object ProtoUtils {
               .concat(createTruncatedByteString(size)))
         }
 
-      // TODO(SPARK-43117): should also support 1, repeated msg; 2, map<xxx, msg>
+      // TODO(SPARK-46988): should support map<xxx, msg>
       case (field: FieldDescriptor, msg: Message)
-          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && msg != null =>
+          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && !field.isRepeated
+            && msg != null =>
         builder.setField(field, abbreviate(msg, thresholds))
 
+      case (field: FieldDescriptor, msgs: java.lang.Iterable[_])
+          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && field.isRepeated
+            && msgs != null =>

Review Comment:
   nit: Do we need to consider the scenario where `strings.iterator().hasNext` is `false`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #45056:
URL: https://github.com/apache/spark/pull/45056#discussion_r1482339139


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:
##########
@@ -42,7 +42,17 @@ private[connect] object ProtoUtils {
         val size = string.length
         val threshold = thresholds.getOrElse(STRING, MAX_STRING_SIZE)
         if (size > threshold) {
-          builder.setField(field, createString(string.take(threshold), size))
+          builder.setField(field, truncateString(string, threshold))
+        }
+
+      case (field: FieldDescriptor, strings: java.lang.Iterable[_])
+          if field.getJavaType == FieldDescriptor.JavaType.STRING && field.isRepeated
+            && strings != null =>
+        val threshold = thresholds.getOrElse(STRING, MAX_STRING_SIZE)
+        strings.iterator().asScala.zipWithIndex.foreach {
+          case (string: String, i) if string != null && string.length > threshold =>

Review Comment:
   In fact, for repeated fields, null checks exist whether adding a single element or adding a collection of elements, for example:
   
   ```java
       public Builder addColumnNames(
           java.lang.String value) {
         if (value == null) { throw new NullPointerException(); }
         ensureColumnNamesIsMutable();
         columnNames_.add(value);
         bitField0_ |= 0x00000004;
         onChanged();
         return this;
       }
   // com.google.protobuf.AbstractMessageLite.Builder#addAll(java.lang.Iterable<T>, java.util.List<? super T>)
   protected static <T> void addAll(final Iterable<T> values, final List<? super T> list) {
         checkNotNull(values);
         if (values instanceof LazyStringList) {
           // For StringOrByteStringLists, check the underlying elements to avoid
           // forcing conversions of ByteStrings to Strings.
           // TODO: Could we just prohibit nulls in all protobuf lists and get rid of this? Is
           // if even possible to hit this condition as all protobuf methods check for null first,
           // right?
           List<?> lazyValues = ((LazyStringList) values).getUnderlyingElements();
           LazyStringList lazyList = (LazyStringList) list;
           ....
   ```
   
   also, I have performed the following checks:
   
   - adding a collection of elements with null
    
   ```
   val names = Seq.range(0, 10).map(i => if (i == 3) null else i.toString * 1024)
   val drop = proto.Drop.newBuilder().setInput(sql).addAllColumnNames(names.asJava).build()
   ```
   
   ```
   [info] - truncate repeated strings with nulls *** FAILED *** (3 milliseconds)
   [info]   java.lang.NullPointerException: Element at index 3 is null.
   [info]   at com.google.protobuf.AbstractMessageLite$Builder.addAllCheckingNulls(AbstractMessageLite.java:359)
   [info]   at com.google.protobuf.AbstractMessageLite$Builder.addAll(AbstractMessageLite.java:414)
   [info]   at org.apache.spark.connect.proto.Drop$Builder.addAllColumnNames(Drop.java:1240)
   ```
   
   - add a null element 
   ```
   val drop = proto.Drop.newBuilder().setInput(sql).addColumnNames(null).build()
   ```
   
   ```
   [info] - truncate repeated strings with nulls *** FAILED *** (4 milliseconds)
   [info]   java.lang.NullPointerException:
   [info]   at org.apache.spark.connect.proto.Drop$Builder.addColumnNames(Drop.java:1221)
   ```
   
   As you can see, under normal circumstances, it is impossible to add a null element to the repeated type. So personally, I don't think this null check is necessary, but I don't object to adding it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #45056:
URL: https://github.com/apache/spark/pull/45056#discussion_r1481327727


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:
##########
@@ -42,7 +42,17 @@ private[connect] object ProtoUtils {
         val size = string.length
         val threshold = thresholds.getOrElse(STRING, MAX_STRING_SIZE)
         if (size > threshold) {
-          builder.setField(field, createString(string.take(threshold), size))
+          builder.setField(field, truncateString(string, threshold))
+        }
+
+      case (field: FieldDescriptor, strings: java.lang.Iterable[_])
+          if field.getJavaType == FieldDescriptor.JavaType.STRING && field.isRepeated
+            && strings != null =>
+        val threshold = thresholds.getOrElse(STRING, MAX_STRING_SIZE)
+        strings.iterator().asScala.zipWithIndex.foreach {
+          case (string: String, i) if string != null && string.length > threshold =>

Review Comment:
   Can `string` really be null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #45056:
URL: https://github.com/apache/spark/pull/45056#discussion_r1481292685


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:
##########
@@ -69,23 +79,33 @@ private[connect] object ProtoUtils {
               .concat(createTruncatedByteString(size)))
         }
 
-      // TODO(SPARK-43117): should also support 1, repeated msg; 2, map<xxx, msg>
+      // TODO(SPARK-46988): should support map<xxx, msg>
       case (field: FieldDescriptor, msg: Message)
-          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && msg != null =>
+          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && !field.isRepeated
+            && msg != null =>
         builder.setField(field, abbreviate(msg, thresholds))
 
+      case (field: FieldDescriptor, msgs: java.lang.Iterable[_])
+          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && field.isRepeated
+            && msgs != null =>

Review Comment:
   Do we need to consider the scenario where `strings.iterator().hasNext` is `false`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng closed pull request #45056: [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields
URL: https://github.com/apache/spark/pull/45056


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #45056:
URL: https://github.com/apache/spark/pull/45056#discussion_r1481292685


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:
##########
@@ -69,23 +79,33 @@ private[connect] object ProtoUtils {
               .concat(createTruncatedByteString(size)))
         }
 
-      // TODO(SPARK-43117): should also support 1, repeated msg; 2, map<xxx, msg>
+      // TODO(SPARK-46988): should support map<xxx, msg>
       case (field: FieldDescriptor, msg: Message)
-          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && msg != null =>
+          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && !field.isRepeated
+            && msg != null =>
         builder.setField(field, abbreviate(msg, thresholds))
 
+      case (field: FieldDescriptor, msgs: java.lang.Iterable[_])
+          if field.getJavaType == FieldDescriptor.JavaType.MESSAGE && field.isRepeated
+            && msgs != null =>

Review Comment:
   ~nit: Do we need to consider the scenario where `strings.iterator().hasNext` is `false`?~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-43117][CONNECT] Make `ProtoUtils.abbreviate` support repeated fields [spark]

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.
zhengruifeng commented on code in PR #45056:
URL: https://github.com/apache/spark/pull/45056#discussion_r1482368948


##########
connector/connect/common/src/main/scala/org/apache/spark/sql/connect/common/ProtoUtils.scala:
##########
@@ -42,7 +42,17 @@ private[connect] object ProtoUtils {
         val size = string.length
         val threshold = thresholds.getOrElse(STRING, MAX_STRING_SIZE)
         if (size > threshold) {
-          builder.setField(field, createString(string.take(threshold), size))
+          builder.setField(field, truncateString(string, threshold))
+        }
+
+      case (field: FieldDescriptor, strings: java.lang.Iterable[_])
+          if field.getJavaType == FieldDescriptor.JavaType.STRING && field.isRepeated
+            && strings != null =>
+        val threshold = thresholds.getOrElse(STRING, MAX_STRING_SIZE)
+        strings.iterator().asScala.zipWithIndex.foreach {
+          case (string: String, i) if string != null && string.length > threshold =>

Review Comment:
   thanks for checking this!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org