You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "panbingkun (via GitHub)" <gi...@apache.org> on 2024/03/22 03:07:15 UTC

[PR] [WIP][SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map` as pretty strings [spark]

panbingkun opened a new pull request, #45657:
URL: https://github.com/apache/spark/pull/45657

   
   
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
     8. If you want to add or modify an error type or message, please read the guideline first in
        'common/utils/src/main/resources/error/README.md'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   If benchmark tests were added, please run the benchmarks in GitHub Actions for the consistent environment, and the instructions could accord to: https://spark.apache.org/developer-tools.html#github-workflow-benchmarks.
   -->
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   <!--
   If generative AI tooling has been used in the process of authoring this patch, please include the
   phrase: 'Generated-by: ' followed by the name of the tool and its version.
   If no, write 'No'.
   Please refer to the [ASF Generative Tooling Guidance](https://www.apache.org/legal/generative-tooling.html) for details.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #45657:
URL: https://github.com/apache/spark/pull/45657#issuecomment-2014452948

   1.Why not use `ToPrettyString`?  
   Because in `ToPrettyString`, the data output logic for `DateType` is `different from` that of csv itself, eg:
   - `ToPrettyString`
   
   - `to_csv`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [WIP][SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535126352


##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -603,4 +605,118 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
       $"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
     checkAnswer(actual, Row(Row(1, "2\n2")))
   }
+
+  test("SPARK-47497: to_csv support the data of ArrayType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, null, 300L))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", ArrayType(LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"[100, 200, null, 300]\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of MapType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice",
+      Map("math" -> 100L, "english" -> 200L, "science" -> null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", MapType(StringType, LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"{math -> 100, english -> 200, science -> null}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of StructType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", StructType(Seq(
+        StructField("id1", LongType),
+        StructField("id2", LongType),
+        StructField("id3", LongType))))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"{100, 200, null}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of BinaryType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", "a".getBytes(StandardCharsets.UTF_8))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("b", BinaryType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,[61]"))
+  }
+
+  test("SPARK-47497: to_csv can display VariantType data") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", new VariantVal(Array[Byte](1, 2, 3), Array[Byte](4, 5)))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("v", VariantType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"\""))

Review Comment:
   Although this output seems a little weird, as `VariantVal` improves the method `toString` logic, it should be able to better display human-readable text.
   https://github.com/apache/spark/blob/ca44489f458554714a620827b5fe2978a30db7f9/common/unsafe/src/main/java/org/apache/spark/unsafe/types/VariantVal.java#L101-L109



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535436451


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -64,33 +65,143 @@ class UnivocityGenerator(
   private val nullAsQuotedEmptyString =
     SQLConf.get.getConf(SQLConf.LEGACY_NULL_VALUE_WRITTEN_AS_QUOTED_EMPTY_STRING_CSV)
 
-  @scala.annotation.tailrec
   private def makeConverter(dataType: DataType): ValueConverter = dataType match {
+    case BinaryType =>
+      (getter, ordinal) => SparkStringUtils.getHexString(getter.getBinary(ordinal))
+
     case DateType =>
-      (row: InternalRow, ordinal: Int) => dateFormatter.format(row.getInt(ordinal))
+      (getter, ordinal) => dateFormatter.format(getter.getInt(ordinal))
 
     case TimestampType =>
-      (row: InternalRow, ordinal: Int) => timestampFormatter.format(row.getLong(ordinal))
+      (getter, ordinal) => timestampFormatter.format(getter.getLong(ordinal))
 
     case TimestampNTZType =>
-      (row: InternalRow, ordinal: Int) =>
-        timestampNTZFormatter.format(DateTimeUtils.microsToLocalDateTime(row.getLong(ordinal)))
+      (getter, ordinal) =>
+        timestampNTZFormatter.format(DateTimeUtils.microsToLocalDateTime(getter.getLong(ordinal)))
 
     case YearMonthIntervalType(start, end) =>
-      (row: InternalRow, ordinal: Int) =>
+      (getter, ordinal) =>
         IntervalUtils.toYearMonthIntervalString(
-          row.getInt(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
+          getter.getInt(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
 
     case DayTimeIntervalType(start, end) =>
-      (row: InternalRow, ordinal: Int) =>
-      IntervalUtils.toDayTimeIntervalString(
-        row.getLong(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
+      (getter, ordinal) =>
+        IntervalUtils.toDayTimeIntervalString(
+          getter.getLong(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
 
     case udt: UserDefinedType[_] => makeConverter(udt.sqlType)
 
+    case ArrayType(et, _) =>
+      val elementConverter = makeConverter(et)
+      (getter, ordinal) =>
+        val array = getter.getArray(ordinal)
+        val builder = new StringBuilder
+        builder.append("[")
+        if (array.numElements() > 0) {
+          if (array.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(elementConverter(array, 0))
+          }
+          var i = 1
+          while (i < array.numElements()) {
+            builder.append(", ")
+            if (array.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(elementConverter(array, i))
+            }
+            i += 1
+          }
+        }
+        builder.append("]")
+        builder.toString()
+
+    case MapType(kt, vt, _) =>
+      val keyConverter = makeConverter(kt)
+      val valueConverter = makeConverter(vt)
+      (getter, ordinal) =>
+        val map = getter.getMap(ordinal)
+        val builder = new StringBuilder
+        builder.append("{")
+        if (map.numElements() > 0) {
+          val keyArray = map.keyArray()
+          val valueArray = map.valueArray()
+          builder.append(keyConverter(keyArray, 0))
+          builder.append(" -> ")
+          if (valueArray.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(valueConverter(valueArray, 0))
+          }
+          var i = 1
+          while (i < map.numElements()) {
+            builder.append(", ")
+            builder.append(keyConverter(keyArray, i))
+            builder.append(" -> ")
+            if (valueArray.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(valueConverter(valueArray, i))
+            }
+            i += 1
+          }
+        }
+        builder.append("}")
+        builder.toString()
+
+    case StructType(fields) =>
+      val converters = fields.map(_.dataType).map(makeConverter)
+      (getter, ordinal) =>
+        val row = getter.getStruct(ordinal, fields.length)
+        val builder = new StringBuilder
+        builder.append("{")
+        if (row.numFields > 0) {
+          if (row.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(converters(0)(row, 0))
+          }
+          var i = 1
+          while (i < row.numFields) {
+            builder.append(", ")
+            if (row.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(converters(i)(row, i))

Review Comment:
   Can we add a function for the common null handling code?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535287917


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -36,9 +36,9 @@ class UnivocityGenerator(
   writerSettings.setHeaders(schema.fieldNames: _*)
   private val gen = new CsvWriter(writer, writerSettings)
 
-  // A `ValueConverter` is responsible for converting a value of an `InternalRow` to `String`.
+  // A `ValueConverter` is responsible for converting a value of an `Any` to `String`.
   // When the value is null, this converter should not be called.
-  private type ValueConverter = (InternalRow, Int) => String

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535478355


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -64,33 +65,143 @@ class UnivocityGenerator(
   private val nullAsQuotedEmptyString =
     SQLConf.get.getConf(SQLConf.LEGACY_NULL_VALUE_WRITTEN_AS_QUOTED_EMPTY_STRING_CSV)
 
-  @scala.annotation.tailrec
   private def makeConverter(dataType: DataType): ValueConverter = dataType match {
+    case BinaryType =>
+      (getter, ordinal) => SparkStringUtils.getHexString(getter.getBinary(ordinal))
+
     case DateType =>
-      (row: InternalRow, ordinal: Int) => dateFormatter.format(row.getInt(ordinal))
+      (getter, ordinal) => dateFormatter.format(getter.getInt(ordinal))
 
     case TimestampType =>
-      (row: InternalRow, ordinal: Int) => timestampFormatter.format(row.getLong(ordinal))
+      (getter, ordinal) => timestampFormatter.format(getter.getLong(ordinal))
 
     case TimestampNTZType =>
-      (row: InternalRow, ordinal: Int) =>
-        timestampNTZFormatter.format(DateTimeUtils.microsToLocalDateTime(row.getLong(ordinal)))
+      (getter, ordinal) =>
+        timestampNTZFormatter.format(DateTimeUtils.microsToLocalDateTime(getter.getLong(ordinal)))
 
     case YearMonthIntervalType(start, end) =>
-      (row: InternalRow, ordinal: Int) =>
+      (getter, ordinal) =>
         IntervalUtils.toYearMonthIntervalString(
-          row.getInt(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
+          getter.getInt(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
 
     case DayTimeIntervalType(start, end) =>
-      (row: InternalRow, ordinal: Int) =>
-      IntervalUtils.toDayTimeIntervalString(
-        row.getLong(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
+      (getter, ordinal) =>
+        IntervalUtils.toDayTimeIntervalString(
+          getter.getLong(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
 
     case udt: UserDefinedType[_] => makeConverter(udt.sqlType)
 
+    case ArrayType(et, _) =>
+      val elementConverter = makeConverter(et)
+      (getter, ordinal) =>
+        val array = getter.getArray(ordinal)
+        val builder = new StringBuilder
+        builder.append("[")
+        if (array.numElements() > 0) {
+          if (array.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(elementConverter(array, 0))
+          }
+          var i = 1
+          while (i < array.numElements()) {
+            builder.append(", ")
+            if (array.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(elementConverter(array, i))
+            }
+            i += 1
+          }
+        }
+        builder.append("]")
+        builder.toString()
+
+    case MapType(kt, vt, _) =>
+      val keyConverter = makeConverter(kt)
+      val valueConverter = makeConverter(vt)
+      (getter, ordinal) =>
+        val map = getter.getMap(ordinal)
+        val builder = new StringBuilder
+        builder.append("{")
+        if (map.numElements() > 0) {
+          val keyArray = map.keyArray()
+          val valueArray = map.valueArray()
+          builder.append(keyConverter(keyArray, 0))
+          builder.append(" -> ")
+          if (valueArray.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(valueConverter(valueArray, 0))
+          }
+          var i = 1
+          while (i < map.numElements()) {
+            builder.append(", ")
+            builder.append(keyConverter(keyArray, i))
+            builder.append(" -> ")
+            if (valueArray.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(valueConverter(valueArray, i))
+            }
+            i += 1
+          }
+        }
+        builder.append("}")
+        builder.toString()
+
+    case StructType(fields) =>
+      val converters = fields.map(_.dataType).map(makeConverter)
+      (getter, ordinal) =>
+        val row = getter.getStruct(ordinal, fields.length)
+        val builder = new StringBuilder
+        builder.append("{")
+        if (row.numFields > 0) {
+          if (row.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(converters(0)(row, 0))
+          }
+          var i = 1
+          while (i < row.numFields) {
+            builder.append(", ")
+            if (row.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(converters(i)(row, i))

Review Comment:
   Okay



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497][SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1537159668


##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -603,4 +605,167 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
       $"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
     checkAnswer(actual, Row(Row(1, "2\n2")))
   }
+
+  test("SPARK-47497: null value display when w or w/o options (nullValue)") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", null, "y")))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("x", StringType),
+      StructField("y", StringType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,,y"))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,-,y"))
+  }
+
+  test("SPARK-47497: to_csv support the data of ArrayType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, null, 300L))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", ArrayType(LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"[100, 200,, 300]\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"[100, 200, -, 300]\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of MapType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice",
+      Map("math" -> 100L, "english" -> 200L, "science" -> null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", MapType(StringType, LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"{math -> 100, english -> 200, science ->}\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"{math -> 100, english -> 200, science -> -}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of StructType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", StructType(Seq(
+        StructField("id1", LongType),
+        StructField("id2", LongType),
+        StructField("id3", LongType))))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"{100, 200,}\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"{100, 200, -}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of BinaryType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", "a".getBytes(StandardCharsets.UTF_8))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("a", BinaryType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,[61]"))
+  }
+
+  test("SPARK-47497: to_csv can display NullType data") {
+    val df = Seq(Tuple1(Tuple1(null))).toDF("value")
+    val options = Map("nullValue" -> "-")
+    val actual = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual, Row("-"))
+  }
+
+  test("SPARK-47497: from_csv/to_csv does not support VariantType data") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", new VariantVal(Array[Byte](1, 2, 3), Array[Byte](4, 5)))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("v", VariantType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+
+    checkError(
+      exception = intercept[AnalysisException] {
+        df.select(to_csv($"value")).collect()
+      },
+      errorClass = "DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE",
+      parameters = Map(
+        "functionName" -> "`to_csv`",
+        "dataType" -> "\"STRUCT<age: BIGINT, name: STRING, v: VARIANT>\"",
+        "sqlExpr" -> "\"to_csv(value)\""),
+      context = ExpectedContext(fragment = "to_csv", getCurrentClassCallSitePattern)
+    )
+
+    checkError(
+      exception = intercept[SparkUnsupportedOperationException] {
+        df.select(from_csv(lit("data"), valueSchema, Map.empty[String, String])).collect()
+      },
+      errorClass = "UNSUPPORTED_DATATYPE",
+      parameters = Map("typeName" -> "\"VARIANT\"")
+    )
+  }
+
+  test("SPARK-47497: the input of to_csv must be StructType") {

Review Comment:
   Let me add a UT for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535135208


##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -603,4 +605,118 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
       $"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
     checkAnswer(actual, Row(Row(1, "2\n2")))
   }
+
+  test("SPARK-47497: to_csv support the data of ArrayType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, null, 300L))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", ArrayType(LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"[100, 200, null, 300]\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of MapType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice",
+      Map("math" -> 100L, "english" -> 200L, "science" -> null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", MapType(StringType, LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"{math -> 100, english -> 200, science -> null}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of StructType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", StructType(Seq(
+        StructField("id1", LongType),
+        StructField("id2", LongType),
+        StructField("id3", LongType))))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"{100, 200, null}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of BinaryType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", "a".getBytes(StandardCharsets.UTF_8))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("b", BinaryType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,[61]"))
+  }
+
+  test("SPARK-47497: to_csv can display VariantType data") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", new VariantVal(Array[Byte](1, 2, 3), Array[Byte](4, 5)))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("v", VariantType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"\""))

Review Comment:
   Okay



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497][SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #45657: [SPARK-47497][SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings
URL: https://github.com/apache/spark/pull/45657


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535132166


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -36,9 +36,9 @@ class UnivocityGenerator(
   writerSettings.setHeaders(schema.fieldNames: _*)
   private val gen = new CsvWriter(writer, writerSettings)
 
-  // A `ValueConverter` is responsible for converting a value of an `InternalRow` to `String`.
+  // A `ValueConverter` is responsible for converting a value of an `Any` to `String`.
   // When the value is null, this converter should not be called.
-  private type ValueConverter = (InternalRow, Int) => String

Review Comment:
   `(InternalRow, Int)` is more efficient than `Any`, as we can avoid boxing for primitive values.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535384910


##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15591,12 +15591,12 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col
     >>> from pyspark.sql import Row, functions as sf
     >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
     >>> df = spark.createDataFrame(data, ("key", "value"))
-    >>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-    +-----------------------+
-    |to_csv(value)          |
-    +-----------------------+
-    |2,Alice,"[100,200,300]"|
-    +-----------------------+
+    >>> df.select(sf.to_csv(df.value)).show(truncate=False)
+    +-------------------------+
+    |to_csv(value)            |
+    +-------------------------+
+    |2,Alice,"[100, 200, 300]"|

Review Comment:
   Aligning with `ToPrettyString`, adding a `space` between array elements seems reasonable, eg:
   <img width="687" alt="image" src="https://github.com/apache/spark/assets/15246973/0ab81ee7-7a34-426e-8cd7-ae8e995a70e0">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535133053


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -36,9 +36,9 @@ class UnivocityGenerator(
   writerSettings.setHeaders(schema.fieldNames: _*)
   private val gen = new CsvWriter(writer, writerSettings)
 
-  // A `ValueConverter` is responsible for converting a value of an `InternalRow` to `String`.
+  // A `ValueConverter` is responsible for converting a value of an `Any` to `String`.
   // When the value is null, this converter should not be called.
-  private type ValueConverter = (InternalRow, Int) => String

Review Comment:
   I get that we want to recursively call the converter for array, then we can use `(SpecializedGetters, Int)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535547809


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -64,33 +65,143 @@ class UnivocityGenerator(
   private val nullAsQuotedEmptyString =
     SQLConf.get.getConf(SQLConf.LEGACY_NULL_VALUE_WRITTEN_AS_QUOTED_EMPTY_STRING_CSV)
 
-  @scala.annotation.tailrec
   private def makeConverter(dataType: DataType): ValueConverter = dataType match {
+    case BinaryType =>
+      (getter, ordinal) => SparkStringUtils.getHexString(getter.getBinary(ordinal))
+
     case DateType =>
-      (row: InternalRow, ordinal: Int) => dateFormatter.format(row.getInt(ordinal))
+      (getter, ordinal) => dateFormatter.format(getter.getInt(ordinal))
 
     case TimestampType =>
-      (row: InternalRow, ordinal: Int) => timestampFormatter.format(row.getLong(ordinal))
+      (getter, ordinal) => timestampFormatter.format(getter.getLong(ordinal))
 
     case TimestampNTZType =>
-      (row: InternalRow, ordinal: Int) =>
-        timestampNTZFormatter.format(DateTimeUtils.microsToLocalDateTime(row.getLong(ordinal)))
+      (getter, ordinal) =>
+        timestampNTZFormatter.format(DateTimeUtils.microsToLocalDateTime(getter.getLong(ordinal)))
 
     case YearMonthIntervalType(start, end) =>
-      (row: InternalRow, ordinal: Int) =>
+      (getter, ordinal) =>
         IntervalUtils.toYearMonthIntervalString(
-          row.getInt(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
+          getter.getInt(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
 
     case DayTimeIntervalType(start, end) =>
-      (row: InternalRow, ordinal: Int) =>
-      IntervalUtils.toDayTimeIntervalString(
-        row.getLong(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
+      (getter, ordinal) =>
+        IntervalUtils.toDayTimeIntervalString(
+          getter.getLong(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
 
     case udt: UserDefinedType[_] => makeConverter(udt.sqlType)
 
+    case ArrayType(et, _) =>
+      val elementConverter = makeConverter(et)
+      (getter, ordinal) =>
+        val array = getter.getArray(ordinal)
+        val builder = new StringBuilder
+        builder.append("[")
+        if (array.numElements() > 0) {
+          if (array.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(elementConverter(array, 0))
+          }
+          var i = 1
+          while (i < array.numElements()) {
+            builder.append(", ")
+            if (array.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(elementConverter(array, i))
+            }
+            i += 1
+          }
+        }
+        builder.append("]")
+        builder.toString()
+
+    case MapType(kt, vt, _) =>
+      val keyConverter = makeConverter(kt)
+      val valueConverter = makeConverter(vt)
+      (getter, ordinal) =>
+        val map = getter.getMap(ordinal)
+        val builder = new StringBuilder
+        builder.append("{")
+        if (map.numElements() > 0) {
+          val keyArray = map.keyArray()
+          val valueArray = map.valueArray()
+          builder.append(keyConverter(keyArray, 0))
+          builder.append(" -> ")
+          if (valueArray.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(valueConverter(valueArray, 0))
+          }
+          var i = 1
+          while (i < map.numElements()) {
+            builder.append(", ")
+            builder.append(keyConverter(keyArray, i))
+            builder.append(" -> ")
+            if (valueArray.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(valueConverter(valueArray, i))
+            }
+            i += 1
+          }
+        }
+        builder.append("}")
+        builder.toString()
+
+    case StructType(fields) =>
+      val converters = fields.map(_.dataType).map(makeConverter)
+      (getter, ordinal) =>
+        val row = getter.getStruct(ordinal, fields.length)
+        val builder = new StringBuilder
+        builder.append("{")
+        if (row.numFields > 0) {
+          if (row.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(converters(0)(row, 0))
+          }
+          var i = 1
+          while (i < row.numFields) {
+            builder.append(", ")
+            if (row.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(converters(i)(row, i))

Review Comment:
   Please let me check a possible issue before merging.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on PR #45657:
URL: https://github.com/apache/spark/pull/45657#issuecomment-2014467555

   cc @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.
cloud-fan commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535133900


##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -603,4 +605,118 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
       $"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
     checkAnswer(actual, Row(Row(1, "2\n2")))
   }
+
+  test("SPARK-47497: to_csv support the data of ArrayType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, null, 300L))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", ArrayType(LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"[100, 200, null, 300]\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of MapType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice",
+      Map("math" -> 100L, "english" -> 200L, "science" -> null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", MapType(StringType, LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"{math -> 100, english -> 200, science -> null}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of StructType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", StructType(Seq(
+        StructField("id1", LongType),
+        StructField("id2", LongType),
+        StructField("id3", LongType))))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"{100, 200, null}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of BinaryType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", "a".getBytes(StandardCharsets.UTF_8))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("b", BinaryType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,[61]"))
+  }
+
+  test("SPARK-47497: to_csv can display VariantType data") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", new VariantVal(Array[Byte](1, 2, 3), Array[Byte](4, 5)))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("v", VariantType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,\"\""))

Review Comment:
   I'm fine to ban variant type in `to_csv`, as it's new.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497][SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1537145637


##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -603,4 +605,167 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
       $"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
     checkAnswer(actual, Row(Row(1, "2\n2")))
   }
+
+  test("SPARK-47497: null value display when w or w/o options (nullValue)") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", null, "y")))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("x", StringType),
+      StructField("y", StringType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,,y"))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,-,y"))
+  }
+
+  test("SPARK-47497: to_csv support the data of ArrayType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, null, 300L))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", ArrayType(LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"[100, 200,, 300]\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"[100, 200, -, 300]\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of MapType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice",
+      Map("math" -> 100L, "english" -> 200L, "science" -> null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", MapType(StringType, LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"{math -> 100, english -> 200, science ->}\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"{math -> 100, english -> 200, science -> -}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of StructType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", StructType(Seq(
+        StructField("id1", LongType),
+        StructField("id2", LongType),
+        StructField("id3", LongType))))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"{100, 200,}\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"{100, 200, -}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of BinaryType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", "a".getBytes(StandardCharsets.UTF_8))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("a", BinaryType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,[61]"))
+  }
+
+  test("SPARK-47497: to_csv can display NullType data") {
+    val df = Seq(Tuple1(Tuple1(null))).toDF("value")
+    val options = Map("nullValue" -> "-")
+    val actual = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual, Row("-"))
+  }
+
+  test("SPARK-47497: from_csv/to_csv does not support VariantType data") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", new VariantVal(Array[Byte](1, 2, 3), Array[Byte](4, 5)))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("v", VariantType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+
+    checkError(
+      exception = intercept[AnalysisException] {
+        df.select(to_csv($"value")).collect()
+      },
+      errorClass = "DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE",
+      parameters = Map(
+        "functionName" -> "`to_csv`",
+        "dataType" -> "\"STRUCT<age: BIGINT, name: STRING, v: VARIANT>\"",
+        "sqlExpr" -> "\"to_csv(value)\""),
+      context = ExpectedContext(fragment = "to_csv", getCurrentClassCallSitePattern)
+    )
+
+    checkError(
+      exception = intercept[SparkUnsupportedOperationException] {
+        df.select(from_csv(lit("data"), valueSchema, Map.empty[String, String])).collect()
+      },
+      errorClass = "UNSUPPORTED_DATATYPE",
+      parameters = Map("typeName" -> "\"VARIANT\"")
+    )
+  }
+
+  test("SPARK-47497: the input of to_csv must be StructType") {

Review Comment:
   What if the input is a nested structure? For example, the item of the Array is a Map, does it remain consistent with what we've seen before?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497][SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.
LuciferYang commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1537151035


##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -603,4 +605,167 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
       $"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
     checkAnswer(actual, Row(Row(1, "2\n2")))
   }
+
+  test("SPARK-47497: null value display when w or w/o options (nullValue)") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", null, "y")))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("x", StringType),
+      StructField("y", StringType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,,y"))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,-,y"))
+  }
+
+  test("SPARK-47497: to_csv support the data of ArrayType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, null, 300L))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", ArrayType(LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"[100, 200,, 300]\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"[100, 200, -, 300]\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of MapType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice",
+      Map("math" -> 100L, "english" -> 200L, "science" -> null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", MapType(StringType, LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"{math -> 100, english -> 200, science ->}\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"{math -> 100, english -> 200, science -> -}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of StructType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", StructType(Seq(
+        StructField("id1", LongType),
+        StructField("id2", LongType),
+        StructField("id3", LongType))))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"{100, 200,}\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"{100, 200, -}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of BinaryType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", "a".getBytes(StandardCharsets.UTF_8))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("a", BinaryType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,[61]"))
+  }
+
+  test("SPARK-47497: to_csv can display NullType data") {
+    val df = Seq(Tuple1(Tuple1(null))).toDF("value")
+    val options = Map("nullValue" -> "-")
+    val actual = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual, Row("-"))
+  }
+
+  test("SPARK-47497: from_csv/to_csv does not support VariantType data") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", new VariantVal(Array[Byte](1, 2, 3), Array[Byte](4, 5)))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("v", VariantType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+
+    checkError(
+      exception = intercept[AnalysisException] {
+        df.select(to_csv($"value")).collect()
+      },
+      errorClass = "DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE",
+      parameters = Map(
+        "functionName" -> "`to_csv`",
+        "dataType" -> "\"STRUCT<age: BIGINT, name: STRING, v: VARIANT>\"",
+        "sqlExpr" -> "\"to_csv(value)\""),
+      context = ExpectedContext(fragment = "to_csv", getCurrentClassCallSitePattern)
+    )
+
+    checkError(
+      exception = intercept[SparkUnsupportedOperationException] {
+        df.select(from_csv(lit("data"), valueSchema, Map.empty[String, String])).collect()
+      },
+      errorClass = "UNSUPPORTED_DATATYPE",
+      parameters = Map("typeName" -> "\"VARIANT\"")
+    )
+  }
+
+  test("SPARK-47497: the input of to_csv must be StructType") {

Review Comment:
   If it's already supported, should we add some corresponding test cases?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497][SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1537172409


##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -603,4 +605,167 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
       $"csv", schema_of_csv("1,2\n2"), Map.empty[String, String].asJava))
     checkAnswer(actual, Row(Row(1, "2\n2")))
   }
+
+  test("SPARK-47497: null value display when w or w/o options (nullValue)") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", null, "y")))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("x", StringType),
+      StructField("y", StringType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,,y"))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,-,y"))
+  }
+
+  test("SPARK-47497: to_csv support the data of ArrayType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Array(100L, 200L, null, 300L))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", ArrayType(LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"[100, 200,, 300]\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"[100, 200, -, 300]\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of MapType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice",
+      Map("math" -> 100L, "english" -> 200L, "science" -> null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", MapType(StringType, LongType))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"{math -> 100, english -> 200, science ->}\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"{math -> 100, english -> 200, science -> -}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of StructType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("scores", StructType(Seq(
+        StructField("id1", LongType),
+        StructField("id2", LongType),
+        StructField("id3", LongType))))))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual1 = df.select(to_csv($"value"))
+    checkAnswer(actual1, Row("2,Alice,\"{100, 200,}\""))
+
+    val options = Map("nullValue" -> "-")
+    val actual2 = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual2, Row("2,Alice,\"{100, 200, -}\""))
+  }
+
+  test("SPARK-47497: to_csv support the data of BinaryType as pretty strings") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", "a".getBytes(StandardCharsets.UTF_8))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("a", BinaryType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+    val actual = df.select(to_csv($"value"))
+    checkAnswer(actual, Row("2,Alice,[61]"))
+  }
+
+  test("SPARK-47497: to_csv can display NullType data") {
+    val df = Seq(Tuple1(Tuple1(null))).toDF("value")
+    val options = Map("nullValue" -> "-")
+    val actual = df.select(to_csv($"value", options.asJava))
+    checkAnswer(actual, Row("-"))
+  }
+
+  test("SPARK-47497: from_csv/to_csv does not support VariantType data") {
+    val rows = new java.util.ArrayList[Row]()
+    rows.add(Row(1L, Row(2L, "Alice", new VariantVal(Array[Byte](1, 2, 3), Array[Byte](4, 5)))))
+
+    val valueSchema = StructType(Seq(
+      StructField("age", LongType),
+      StructField("name", StringType),
+      StructField("v", VariantType)))
+    val schema = StructType(Seq(
+      StructField("key", LongType),
+      StructField("value", valueSchema)))
+
+    val df = spark.createDataFrame(rows, schema)
+
+    checkError(
+      exception = intercept[AnalysisException] {
+        df.select(to_csv($"value")).collect()
+      },
+      errorClass = "DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE",
+      parameters = Map(
+        "functionName" -> "`to_csv`",
+        "dataType" -> "\"STRUCT<age: BIGINT, name: STRING, v: VARIANT>\"",
+        "sqlExpr" -> "\"to_csv(value)\""),
+      context = ExpectedContext(fragment = "to_csv", getCurrentClassCallSitePattern)
+    )
+
+    checkError(
+      exception = intercept[SparkUnsupportedOperationException] {
+        df.select(from_csv(lit("data"), valueSchema, Map.empty[String, String])).collect()
+      },
+      errorClass = "UNSUPPORTED_DATATYPE",
+      parameters = Map("typeName" -> "\"VARIANT\"")
+    )
+  }
+
+  test("SPARK-47497: the input of to_csv must be StructType") {

Review Comment:
   @LuciferYang 
   The related UT here: https://github.com/apache/spark/pull/45692



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497]SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.
panbingkun commented on code in PR #45657:
URL: https://github.com/apache/spark/pull/45657#discussion_r1535478355


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -64,33 +65,143 @@ class UnivocityGenerator(
   private val nullAsQuotedEmptyString =
     SQLConf.get.getConf(SQLConf.LEGACY_NULL_VALUE_WRITTEN_AS_QUOTED_EMPTY_STRING_CSV)
 
-  @scala.annotation.tailrec
   private def makeConverter(dataType: DataType): ValueConverter = dataType match {
+    case BinaryType =>
+      (getter, ordinal) => SparkStringUtils.getHexString(getter.getBinary(ordinal))
+
     case DateType =>
-      (row: InternalRow, ordinal: Int) => dateFormatter.format(row.getInt(ordinal))
+      (getter, ordinal) => dateFormatter.format(getter.getInt(ordinal))
 
     case TimestampType =>
-      (row: InternalRow, ordinal: Int) => timestampFormatter.format(row.getLong(ordinal))
+      (getter, ordinal) => timestampFormatter.format(getter.getLong(ordinal))
 
     case TimestampNTZType =>
-      (row: InternalRow, ordinal: Int) =>
-        timestampNTZFormatter.format(DateTimeUtils.microsToLocalDateTime(row.getLong(ordinal)))
+      (getter, ordinal) =>
+        timestampNTZFormatter.format(DateTimeUtils.microsToLocalDateTime(getter.getLong(ordinal)))
 
     case YearMonthIntervalType(start, end) =>
-      (row: InternalRow, ordinal: Int) =>
+      (getter, ordinal) =>
         IntervalUtils.toYearMonthIntervalString(
-          row.getInt(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
+          getter.getInt(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
 
     case DayTimeIntervalType(start, end) =>
-      (row: InternalRow, ordinal: Int) =>
-      IntervalUtils.toDayTimeIntervalString(
-        row.getLong(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
+      (getter, ordinal) =>
+        IntervalUtils.toDayTimeIntervalString(
+          getter.getLong(ordinal), IntervalStringStyles.ANSI_STYLE, start, end)
 
     case udt: UserDefinedType[_] => makeConverter(udt.sqlType)
 
+    case ArrayType(et, _) =>
+      val elementConverter = makeConverter(et)
+      (getter, ordinal) =>
+        val array = getter.getArray(ordinal)
+        val builder = new StringBuilder
+        builder.append("[")
+        if (array.numElements() > 0) {
+          if (array.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(elementConverter(array, 0))
+          }
+          var i = 1
+          while (i < array.numElements()) {
+            builder.append(", ")
+            if (array.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(elementConverter(array, i))
+            }
+            i += 1
+          }
+        }
+        builder.append("]")
+        builder.toString()
+
+    case MapType(kt, vt, _) =>
+      val keyConverter = makeConverter(kt)
+      val valueConverter = makeConverter(vt)
+      (getter, ordinal) =>
+        val map = getter.getMap(ordinal)
+        val builder = new StringBuilder
+        builder.append("{")
+        if (map.numElements() > 0) {
+          val keyArray = map.keyArray()
+          val valueArray = map.valueArray()
+          builder.append(keyConverter(keyArray, 0))
+          builder.append(" -> ")
+          if (valueArray.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(valueConverter(valueArray, 0))
+          }
+          var i = 1
+          while (i < map.numElements()) {
+            builder.append(", ")
+            builder.append(keyConverter(keyArray, i))
+            builder.append(" -> ")
+            if (valueArray.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(valueConverter(valueArray, i))
+            }
+            i += 1
+          }
+        }
+        builder.append("}")
+        builder.toString()
+
+    case StructType(fields) =>
+      val converters = fields.map(_.dataType).map(makeConverter)
+      (getter, ordinal) =>
+        val row = getter.getStruct(ordinal, fields.length)
+        val builder = new StringBuilder
+        builder.append("{")
+        if (row.numFields > 0) {
+          if (row.isNullAt(0)) {
+            if (nullAsQuotedEmptyString) {
+              builder.append(options.nullValue)
+            } else {
+              builder.append(null.asInstanceOf[String])
+            }
+          } else {
+            builder.append(converters(0)(row, 0))
+          }
+          var i = 1
+          while (i < row.numFields) {
+            builder.append(", ")
+            if (row.isNullAt(i)) {
+              if (nullAsQuotedEmptyString) {
+                builder.append(options.nullValue)
+              } else {
+                builder.append(null.asInstanceOf[String])
+              }
+            } else {
+              builder.append(converters(i)(row, i))

Review Comment:
   Okay, I found a bug in the original (Before applying `this PR`) `to_csv`, as follows:
   <img width="985" alt="image" src="https://github.com/apache/spark/assets/15246973/7491d868-3236-4518-8b27-7aed769d26e4">
   https://github.com/apache/spark/blob/227a50a1766ac1476b0031e1c60d2604eccdb9a7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala#L100-L104
   <img width="538" alt="image" src="https://github.com/apache/spark/assets/15246973/e576e7e0-a03b-497f-929a-cf8b1cdcfcc8">
   
   when `row.isNullAt(i)` && `nullAsQuotedEmptyString == false`, 
   `values(i)` is not assigned, so it is the default value null, which seems incorrect?
   
   <img width="859" alt="image" src="https://github.com/apache/spark/assets/15246973/68d263e9-934d-436f-b79d-1a04b6cd5b76">
   According to this interpretation, we should assign value `""`.
   
   @cloud-fan Should we fix it in this PR or open a separate PR to handle it?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-47497][SQL] Make `to_csv` support the output of `array/struct/map/binary` as pretty strings [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #45657:
URL: https://github.com/apache/spark/pull/45657#issuecomment-2016718242

   Merged to master for Apache Spark 4.0.0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org