You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "panbingkun (via GitHub)" <gi...@apache.org> on 2024/01/10 13:17:00 UTC

[PR] [SPARK-46654][SQL] df.show() of pyspark displayed different results between Regular Spark and Spark Connect [spark]

panbingkun opened a new pull request, #44665:
URL: https://github.com/apache/spark/pull/44665

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2011158182

   I'd prefer to fix the inconsistency. Can we use `ToPrettyString.eval` to generate pretty strings for these types in `to_csv`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2011171313

   Use `ToPrettyString.eval` to generate pretty strings for these types in `to_csv`, I have also tried it in this PR. Below:
   https://github.com/apache/spark/pull/44665/commits/22a7afb6951b9274cc428f71e675df233de74a8a
   <img width="520" alt="image" src="https://github.com/apache/spark/assets/15246973/961242a9-d26d-4afe-9b01-ad7ea6ab25d9">
   (PS: Maybe some details about `using ToPrettyString.eval` still need to be confirmed.)
   
   If the final discussion result is to `fix their consistency`, this approach should also be possible.
   WDYT @MaxGekk @HyukjinKwon @LuciferYang @xinrong-meng @srowen ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "srowen (via GitHub)" <gi...@apache.org>.

srowen commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1499706735


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##########
@@ -260,16 +263,43 @@ case class StructsToCsv(
       child = child,
       timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    child.dataType match {
+      case schema: StructType if schema.map(_.dataType).forall(
+        dt => supportDataType(dt)) => TypeCheckSuccess
+      case _: StructType => DataTypeMismatch(errorSubClass = "TO_CSV_COMPLEX_TYPE")
+      case _ => DataTypeMismatch(
+        errorSubClass = "UNEXPECTED_INPUT_TYPE",
+        messageParameters = Map(
+          "paramIndex" -> "1",
+          "requiredType" -> toSQLType(StructType),
+          "inputSql" -> toSQLExpr(child),
+          "inputType" -> toSQLType(child.dataType)
+        )
+      )
+    }
+  }
+
+  @tailrec
+  private def supportDataType(dataType: DataType): Boolean = dataType match {
+    case _: VariantType => false

Review Comment:
   Just to keep this tidy, remove the blank lines and use the `|` syntax to handle multiple cases with the same output together



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2009888375

   > You never know how people will use Spark and I'm not sure why it's necessary to ban this use case. 
   
   I believe it is not too bad to ban such output in any case (this is even insecure). @cloud-fan Do you see any use-case for that?
   
   Array:
   ```scala
       val valueSchema = StructType(Seq(StructField("age", LongType), StructField("name", StringType), StructField("scores", ArrayType(LongType))))
       val schema = StructType(Seq(StructField("key", LongType), StructField("value", valueSchema)))
       val df = spark.createDataFrame(rows, schema)
       df.select(to_csv($"value")).show(false)
   
   
   +--------------------------------------------------------------------------+
   |to_csv(value)                                                             |
   +--------------------------------------------------------------------------+
   |2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeArrayData@e7bb2e54|
   +--------------------------------------------------------------------------+
   ```
   
   Map:
   ```scala
   +------------------------------------------------------------------------+
   |to_csv(value)                                                           |
   +------------------------------------------------------------------------+
   |2,Alice,org.apache.spark.sql.catalyst.expressions.UnsafeMapData@65ab9790|
   +------------------------------------------------------------------------+
   ```
   
   Struct:
   ```scala
       val rows = new java.util.ArrayList[Row]()
       rows.add(Row(1L, Row(2L, "Alice", Row(100L, 200L, null))))
       val valueSchema = StructType(Seq(StructField("age", LongType), StructField("name", StringType), StructField("scores", StructType(Seq(StructField("id1", LongType), StructField("id2", LongType), StructField("id3", LongType))))))
       val schema = StructType(Seq(StructField("key", LongType), StructField("value", valueSchema)))
       val df = spark.createDataFrame(rows, schema)
   
   +---------------------+
   |to_csv(value)        |
   +---------------------+
   |2,Alice,"[4,64,c8,0]"|
   +---------------------+
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1500137200


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##########
@@ -260,16 +263,43 @@ case class StructsToCsv(
       child = child,
       timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    child.dataType match {
+      case schema: StructType if schema.map(_.dataType).forall(
+        dt => supportDataType(dt)) => TypeCheckSuccess
+      case _: StructType => DataTypeMismatch(errorSubClass = "TO_CSV_COMPLEX_TYPE")
+      case _ => DataTypeMismatch(
+        errorSubClass = "UNEXPECTED_INPUT_TYPE",
+        messageParameters = Map(
+          "paramIndex" -> "1",
+          "requiredType" -> toSQLType(StructType),
+          "inputSql" -> toSQLExpr(child),
+          "inputType" -> toSQLType(child.dataType)
+        )
+      )
+    }
+  }
+
+  @tailrec
+  private def supportDataType(dataType: DataType): Boolean = dataType match {
+    case _: VariantType => false

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk closed pull request #44665: [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data
URL: https://github.com/apache/spark/pull/44665


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519249360


##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col
     |      2,Alice|
     +-------------+
 
-    Example 2: Converting a complex StructType to a CSV string
-
-    >>> from pyspark.sql import Row, functions as sf
-    >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
-    >>> df = spark.createDataFrame(data, ("key", "value"))
-    >>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-    +-----------------------+
-    |to_csv(value)          |
-    +-----------------------+
-    |2,Alice,"[100,200,300]"|

Review Comment:
   I have updated the document `sql-migration-guide.md`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "xinrong-meng (via GitHub)" <gi...@apache.org>.

xinrong-meng commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1960083872

   Would you please add `[PYTHON]` to title considering it has user-facing python change? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1925637254

   > Can the `from_csv` function parse the output of `to_csv()` back? Could you add tests or modify existing tests, please.
   
   For this case, currently the `from_csv` function cann't parse the output of `to_csv()` back.
   But I think after that, we can improve `from_csv so that it can complete this function.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1520791983


##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col
     |      2,Alice|
     +-------------+
 
-    Example 2: Converting a complex StructType to a CSV string
-
-    >>> from pyspark.sql import Row, functions as sf
-    >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
-    >>> df = spark.createDataFrame(data, ("key", "value"))
-    >>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-    +-----------------------+
-    |to_csv(value)          |
-    +-----------------------+
-    |2,Alice,"[100,200,300]"|

Review Comment:
   In `this case` of Python, it is because the data type is ArrayData (specifically implemented as `GenericArrayData`),
   This case just goes into the following logic:
   
   https://github.com/apache/spark/blob/09bdf2a9334f210ece4c23d0f3324f81113330d0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala#L91-L93
   
   https://github.com/apache/spark/blob/09bdf2a9334f210ece4c23d0f3324f81113330d0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L85
   
   
   And it just so happens that `GenericArrayData` overrides the `toString` method, so it appears to display the result `correctly`, perhaps it's just a `coincidence`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2014129875

   > If no objections, I'll revert this tomorrow. We can have a new PR to add pretty string for array/struct/map in `to_csv`.
   
   I will submit a new PR to `add pretty string for array/struct/map` in to_csv today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1464570225


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -89,8 +90,11 @@ class UnivocityGenerator(
     case udt: UserDefinedType[_] => makeConverter(udt.sqlType)
 
     case dt: DataType =>
-      (row: InternalRow, ordinal: Int) =>
-        row.get(ordinal, dt).toString
+      (row: InternalRow, ordinal: Int) => {
+        val valueUTF8String = ToPrettyString(Literal(row.get(ordinal, dt), dt),

Review Comment:
   I changed the approach and enabled `UnivocityGenerator ` to support complex type (`ArrayType`, `MapType`, `StructType`) outputs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1886156532

   > @panbingkun Let's re-enable the doctest
   > 
   > https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15053-L15058
   > 
   > and remove the TODO
   > 
   > https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15010-L15013
   > 
   > also cc @HyukjinKwon
   
   Okay.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2008666616

   why do we remove a working feature? what's wrong with `to_csv` generating non-standard but pretty strings for these values?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2008710532

   > generating non-standard but pretty strings
   
   @cloud-fan It prints "address" of objects, not values. Do you have an example which you worry about?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1951708007

   > I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for `ArrayType` and `MapType`?
   
   Should we add the legacy conf? It's hard to confirm whether any users have relied on this erroneous behavior, and directly disallowing it may bring migration costs.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1960689067

   Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1520694548


##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col
     |      2,Alice|
     +-------------+
 
-    Example 2: Converting a complex StructType to a CSV string
-
-    >>> from pyspark.sql import Row, functions as sf
-    >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
-    >>> df = spark.createDataFrame(data, ("key", "value"))
-    >>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-    +-----------------------+
-    |to_csv(value)          |
-    +-----------------------+
-    |2,Alice,"[100,200,300]"|

Review Comment:
   > I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for ArrayType and MapType?
   
   @HyukjinKwon See what it does return for `ArrayType` in the example. I doubt that we should simply disallow it w/o any legacy config at least.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1994249547

   +1, LGTM. Merging to master.
   Thank you, @panbingkun and @HyukjinKwon @LuciferYang @srowen for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2011239475

   It's ok to not support round-trip. array/struct/map is not CSV standard either. I think the principle is to avoid breaking changes if possible. Since `to_csv` already supports array/struct/map, let's keep it but fix the inconsistency issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2009368113

   > So we make a breaking change simply because we don't like the object address in the string? can we fix it?
   
   These types cannot be read back through 'from_csv' after generating non-standard but pretty strings through 'to_csv'.
   Do we need to add configuration to bring it back?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1886085064

   cc @cloud-fan @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1925639982

   > 1. Does CSV source support this? I think it's disallowed
   > 2. If we want to support this it should better be able to read it back. Problem is that CSV spec doesn't have this.
   > 
   > Should probably just disallow this case
   
   I have privately discussed this issue with @LuciferYang , and in response to this situation, we actually have two options:
   - In `to_csv` When encountering `complex types` in CSV, such as `ArrayType`, `MapType`, `StructType`, we can prompt an error indicating that it is not supported.
   **However, in this case, in the `df. show()`, it can actually display complex types, which looks weird.**
   
   - Another way is for us to support it. Indeed, CSV spec doesn't have this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519273344


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##########
@@ -260,16 +263,33 @@ case class StructsToCsv(
       child = child,
       timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    child.dataType match {
+      case schema: StructType if schema.map(_.dataType).forall(
+        dt => isSupportedDataType(dt)) => TypeCheckSuccess
+      case _ => DataTypeMismatch(
+        errorSubClass = "UNSUPPORTED_INPUT_TYPE",

Review Comment:
   Using `UNSUPPORTED_INPUT_TYPE` instead of `UNEXPECTED_INPUT_TYPE` here will make the prompt information more `clear` to the user, for example:
   
   ```
       val rows = new java.util.ArrayList[Row]()
       rows.add(Row(1L, Row(2L, "Alice",
         Map("math" -> 100L, "english" -> 200L, "science" -> null))))
   
       val valueSchema = StructType(Seq(
         StructField("age", LongType),
         StructField("name", StringType),
         StructField("scores", MapType(StringType, LongType))))
       val schema = StructType(Seq(
         StructField("key", LongType),
         StructField("value", valueSchema)))
   
       val df = spark.createDataFrame(rows, schema)
   
       df.select(to_csv($"value")).collect()
   ```
   
   Before(Last review):
   Users may feel confused about prompt: `requires the "STRUCT" type, however "value" has the type "STRUCT...`
   ```
   [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "to_csv(value)" due to data type mismatch: The 1 parameter requires the "STRUCT" type, however "value" has the type "STRUCT<age: BIGINT, name: STRING, scores: MAP<STRING, BIGINT>>". SQLSTATE: 42K09;
   ```
   
   After(Right now):
   ```
   [DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE] Cannot resolve "to_csv(value)" due to data type mismatch: The input of `to_csv` can't be "STRUCT<age: BIGINT, name: STRING, scores: MAP<STRING, BIGINT>>" type data. SQLSTATE: 42K09;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2009679810

   You never know how people will use Spark and I'm not sure why it's necessary to ban this use case. Having inconsistency between scala and python is bad. Can we fix this problem by defining a pretty string format for struct/array/map in the csv writer?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2012496955

   If no objections, I'll revert this tomorrow. We can have a new PR to add pretty string for array/struct/map in `to_csv`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2009269901

   So we make a breaking change simply because we don't like the object address in the string? can we fix it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1462660362


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -89,8 +90,11 @@ class UnivocityGenerator(
     case udt: UserDefinedType[_] => makeConverter(udt.sqlType)
 
     case dt: DataType =>
-      (row: InternalRow, ordinal: Int) =>
-        row.get(ordinal, dt).toString
+      (row: InternalRow, ordinal: Int) => {
+        val valueUTF8String = ToPrettyString(Literal(row.get(ordinal, dt), dt),

Review Comment:
   Let me think again, maybe we need to rewrite the logic similar to `ToPrettyString ` in `UnivocityGenerator ` based on the characteristics of CSV.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1886155258

   @panbingkun Let's re-enable the doctest
   
   https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15053-L15058
   
   and remove the TODO
   
   https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15010-L15013
   
   also cc @HyukjinKwon 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519260055


##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -294,10 +294,19 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
   }
 
   test("to_csv with option (nullValue)") {

Review Comment:
   Because `NullType` cannot be read back through `from_csv`.
   In the new logic, it will throw `DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE` exception.
   And as far as I understand, this UT is mainly to test whether `option (nullValue)` is effective,
   The test `target` of changed UT are consistent with this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2008929754

   `to_json` should be fine because it has nested type representation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2008719720

   @panbingkun I forgot to ask you. Does `to_json` have the same issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2011183252

   I don't oppose this resolution, but if `to_csv` can handle complex data structures, what would our subsequent plans be? Would `from_csv` also need to support complex data structures to make these paired functions more self-consistent? In the long term, do we need to enhance the read and write capabilities for CSV data sources to support complex data structures?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1961086132

   > Is this considered a bug fix? Do we need to backport it to branch-3.5/3.4
   
   Yea, I think we need to backport it to `branch-3.5/3.4`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1926271034

   I think we should simply disallow this. We could have a legacy conf, but it doesn't much make sense. What does it return for `ArrayType` and `MapType`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1461496739


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -89,8 +90,11 @@ class UnivocityGenerator(
     case udt: UserDefinedType[_] => makeConverter(udt.sqlType)
 
     case dt: DataType =>
-      (row: InternalRow, ordinal: Int) =>
-        row.get(ordinal, dt).toString
+      (row: InternalRow, ordinal: Int) => {
+        val valueUTF8String = ToPrettyString(Literal(row.get(ordinal, dt), dt),

Review Comment:
   Does `ToPrettyString` print nested fields in the same way as `UnivocityGenerator`. For instance, does it follow to datetime patterns specified for the CSV datasource?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519273344


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##########
@@ -260,16 +263,33 @@ case class StructsToCsv(
       child = child,
       timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    child.dataType match {
+      case schema: StructType if schema.map(_.dataType).forall(
+        dt => isSupportedDataType(dt)) => TypeCheckSuccess
+      case _ => DataTypeMismatch(
+        errorSubClass = "UNSUPPORTED_INPUT_TYPE",

Review Comment:
   Using `UNSUPPORTED_INPUT_TYPE` instead of `UNEXPECTED_INPUT_TYPE` here will make the prompt information more `clear` to the user, 
   
   for example:
   ```
       val rows = new java.util.ArrayList[Row]()
       rows.add(Row(1L, Row(2L, "Alice",
         Map("math" -> 100L, "english" -> 200L, "science" -> null))))
   
       val valueSchema = StructType(Seq(
         StructField("age", LongType),
         StructField("name", StringType),
         StructField("scores", MapType(StringType, LongType))))
       val schema = StructType(Seq(
         StructField("key", LongType),
         StructField("value", valueSchema)))
   
       val df = spark.createDataFrame(rows, schema)
   
       df.select(to_csv($"value")).collect()
   ```
   
   Before(Last review):
   Users may feel confused about prompt: `requires the "STRUCT" type, however "value" has the type "STRUCT...`
   ```
   [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "to_csv(value)" due to data type mismatch: The 1 parameter requires the "STRUCT" type, however "value" has the type "STRUCT<age: BIGINT, name: STRING, scores: MAP<STRING, BIGINT>>". SQLSTATE: 42K09;
   ```
   
   After(Right now):
   ```
   [DATATYPE_MISMATCH.UNSUPPORTED_INPUT_TYPE] Cannot resolve "to_csv(value)" due to data type mismatch: The input of `to_csv` can't be "STRUCT<age: BIGINT, name: STRING, scores: MAP<STRING, BIGINT>>" type data. SQLSTATE: 42K09;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1960618407

   > Would you please add `[PYTHON]` to title considering it has user-facing python change? Thanks!
   
   Sure, done. 😄


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1985106638

   friendly ping @HyukjinKwon,
   When you are not busy, can you please continue to help review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519047021


##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col
     |      2,Alice|
     +-------------+
 
-    Example 2: Converting a complex StructType to a CSV string
-
-    >>> from pyspark.sql import Row, functions as sf
-    >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
-    >>> df = spark.createDataFrame(data, ("key", "value"))
-    >>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-    +-----------------------+
-    |to_csv(value)          |
-    +-----------------------+
-    |2,Alice,"[100,200,300]"|

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519252452


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##########
@@ -260,16 +263,36 @@ case class StructsToCsv(
       child = child,
       timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    child.dataType match {
+      case schema: StructType if schema.map(_.dataType).forall(
+        dt => supportDataType(dt)) => TypeCheckSuccess
+      case _: StructType => DataTypeMismatch(errorSubClass = "TO_CSV_COMPLEX_TYPE")
+      case _ => DataTypeMismatch(
+        errorSubClass = "UNEXPECTED_INPUT_TYPE",
+        messageParameters = Map(
+          "paramIndex" -> ordinalNumber(0),
+          "requiredType" -> toSQLType(StructType),
+          "inputSql" -> toSQLExpr(child),
+          "inputType" -> toSQLType(child.dataType)
+        )
+      )
+    }
+  }
+
+  @tailrec
+  private def supportDataType(dataType: DataType): Boolean = dataType match {
+    case _: VariantType | BinaryType => false

Review Comment:
   Because `these types` cannot be `read back` through `from_csv`,
   I have added related tests in UT `CsvFunctionsSuite`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2009371343

   > `to_json` should be fine because it has nested type representation
   
   Yeah, there is no standard for nested types in `csv` format, but the `json` format has a standard.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2010991612

   - Why is the result displayed through `to_csv` inconsistency in Scala and Python for this case?
     Because this case is on the `python side`, it ultimately uses `GenericArrayData`, which happens to implement the method `toString`, so `to_csv` displays readable text.
   https://github.com/apache/spark/blob/11247d804cd370aaeb88736a706c587e7f5c83b3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/GenericArrayData.scala#L85
   
   However, on the `scala side`, it ultimately uses `UnsafeArrayData`.  `Unfortunately`, it does not implement the method `toString` (using the default `Object.toString` method), so the final `to_csv` displays `the address of the object`.
   
   - In the implementation process of this PR, it can display `non-standard but pretty strings`,  as follows:
     https://github.com/apache/spark/pull/44665/commits/9695e975f3299556e7c268918ecd51be7a03c157
     <img width="605" alt="image" src="https://github.com/apache/spark/assets/15246973/fd07dc0a-4d61-4663-8631-daff518da278">
     The `disadvantage` of this is that it `cannot` be `read back` through `from_csv` `at present`.
     If the final result of the discussion is acceptable, it should be easy to bring back this feature.
   
   - Another possible compromise solution is to add a configuration (defaultly, it does `not` support displaying data of type [Array, Map, Struct ...] as `non-standard but pretty strings` through `to_csv`). If the user sets this configuration to be enabled, restore the original behavior?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][PYTHON] Make `to_csv` explicitly indicate that it does not support some types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-2014269507

   @cloud-fan 
   The initial version has been submitted here: https://github.com/apache/spark/pull/45657
   I am still improving it, and when it is ready, I will ask for your help in reviewing it. 
   Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1518964720


##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col
     |      2,Alice|
     +-------------+
 
-    Example 2: Converting a complex StructType to a CSV string
-
-    >>> from pyspark.sql import Row, functions as sf
-    >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
-    >>> df = spark.createDataFrame(data, ("key", "value"))
-    >>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-    +-----------------------+
-    |to_csv(value)          |
-    +-----------------------+
-    |2,Alice,"[100,200,300]"|

Review Comment:
   This is a breaking change in fact. We need to update the SQL migration guide at least.



##########
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala:
##########
@@ -294,10 +294,19 @@ class CsvFunctionsSuite extends QueryTest with SharedSparkSession {
   }
 
   test("to_csv with option (nullValue)") {

Review Comment:
   Could you explain, please, why did you change the test?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##########
@@ -260,16 +263,36 @@ case class StructsToCsv(
       child = child,
       timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    child.dataType match {
+      case schema: StructType if schema.map(_.dataType).forall(
+        dt => supportDataType(dt)) => TypeCheckSuccess
+      case _: StructType => DataTypeMismatch(errorSubClass = "TO_CSV_COMPLEX_TYPE")
+      case _ => DataTypeMismatch(
+        errorSubClass = "UNEXPECTED_INPUT_TYPE",
+        messageParameters = Map(
+          "paramIndex" -> ordinalNumber(0),
+          "requiredType" -> toSQLType(StructType),
+          "inputSql" -> toSQLExpr(child),
+          "inputType" -> toSQLType(child.dataType)
+        )
+      )
+    }
+  }
+
+  @tailrec
+  private def supportDataType(dataType: DataType): Boolean = dataType match {
+    case _: VariantType | BinaryType => false

Review Comment:
   Could you clarify this? Why don't you support the types?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/csvExpressions.scala:
##########
@@ -260,16 +263,36 @@ case class StructsToCsv(
       child = child,
       timeZoneId = None)
 
+  override def checkInputDataTypes(): TypeCheckResult = {
+    child.dataType match {
+      case schema: StructType if schema.map(_.dataType).forall(
+        dt => supportDataType(dt)) => TypeCheckSuccess
+      case _: StructType => DataTypeMismatch(errorSubClass = "TO_CSV_COMPLEX_TYPE")
+      case _ => DataTypeMismatch(
+        errorSubClass = "UNEXPECTED_INPUT_TYPE",
+        messageParameters = Map(
+          "paramIndex" -> ordinalNumber(0),
+          "requiredType" -> toSQLType(StructType),
+          "inputSql" -> toSQLExpr(child),
+          "inputType" -> toSQLType(child.dataType)
+        )
+      )
+    }
+  }
+
+  @tailrec
+  private def supportDataType(dataType: DataType): Boolean = dataType match {

Review Comment:
   ```suggestion
     private def isSupportedDataType(dataType: DataType): Boolean = dataType match {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL][Python] Make `to_csv` explicitly indicate that it does not support complex types of data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1519100897


##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -15534,19 +15532,7 @@ def to_csv(col: "ColumnOrName", options: Optional[Dict[str, str]] = None) -> Col
     |      2,Alice|
     +-------------+
 
-    Example 2: Converting a complex StructType to a CSV string
-
-    >>> from pyspark.sql import Row, functions as sf
-    >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
-    >>> df = spark.createDataFrame(data, ("key", "value"))
-    >>> df.select(sf.to_csv(df.value)).show(truncate=False) # doctest: +SKIP
-    +-----------------------+
-    |to_csv(value)          |
-    +-----------------------+
-    |2,Alice,"[100,200,300]"|

Review Comment:
   Let me update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "MaxGekk (via GitHub)" <gi...@apache.org>.

MaxGekk commented on code in PR #44665:
URL: https://github.com/apache/spark/pull/44665#discussion_r1461496739


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityGenerator.scala:
##########
@@ -89,8 +90,11 @@ class UnivocityGenerator(
     case udt: UserDefinedType[_] => makeConverter(udt.sqlType)
 
     case dt: DataType =>
-      (row: InternalRow, ordinal: Int) =>
-        row.get(ordinal, dt).toString
+      (row: InternalRow, ordinal: Int) => {
+        val valueUTF8String = ToPrettyString(Literal(row.get(ordinal, dt), dt),

Review Comment:
   Does `ToPrettyString` prints nested fields in the same way as `UnivocityGenerator`. For instance, does it follow to datetime patterns specified for the CSV datasource?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "panbingkun (via GitHub)" <gi...@apache.org>.

panbingkun commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1886165111

   > > @panbingkun Let's re-enable the doctest
   > > https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15053-L15058
   > > 
   > > and remove the TODO
   > > https://github.com/apache/spark/blob/514ecc6fc183d7222b9dc299af4df328c71966d1/python/pyspark/sql/functions/builtin.py#L15010-L15013
   > > 
   > > also cc @HyukjinKwon
   > 
   > Okay.
   
   Done.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-46654][SQL] Make to_csv can correctly display complex types data [spark]

Posted by "LuciferYang (via GitHub)" <gi...@apache.org>.

LuciferYang commented on PR #44665:
URL: https://github.com/apache/spark/pull/44665#issuecomment-1925595731

   I think the current issue is that the original PySpark can display complex struct types, 
   
   ```
   bin/pyspark
   Python 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   24/02/04 14:14:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   24/02/04 14:14:41 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
         /_/
   
   Using Python version 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022 15:24:06)
   Spark context Web UI available at http://localhost:4041
   Spark context available as 'sc' (master = local[*], app id = local-1707027281550).
   SparkSession available as 'spark'.
   >>> from pyspark.sql import Row, functions as sf
   >>> data = [(1, Row(age=2, name='Alice', scores=[100, 200, 300]))]
   >>> df = spark.createDataFrame(data, ("key", "value"))
   >>> df.select(sf.to_csv(df.value)).show(truncate=False)
   +-----------------------+                                                       
   |to_csv(value)          |
   +-----------------------+
   |2,Alice,"[100,200,300]"|
   +-----------------------+
   ```
   
   but PyConnect and Scala code cannot. Ultimately, we should make them consistent. Should we consider removing the current capability of PySpark? However, this might be a breaking change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org