You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Wei Guo (Jira)" <ji...@apache.org> on 2023/02/03 03:06:00 UTC
[jira] [Updated] (SPARK-42237) change binary to unsupported dataType in csv format
[ https://issues.apache.org/jira/browse/SPARK-42237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wei Guo updated SPARK-42237:
----------------------------
Description:
When a binary colunm is written into csv files, actual content of this colunm is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = Seq(Array[Byte](1,2)).toDF
df.write.csv("/Users/guowei/Desktop/binary_csv")
{code}
The csv file's content is as follows:
!image-2023-01-30-17-21-09-212.png|width=141,height=29!
Meanwhile, if a binary colunm saved as table with csv fileformat, the table can't be read back successfully.
{code:java}
val df = Seq((1, Array[Byte](1,2))).toDF
df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from binaryDataTable").show()
{code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963&docGuid=Eiscz4oMI45Sfp&sign=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
So I think it' better to change binary to unsupported dataType in csv format, both for datasource v1(CSVFileFormat) and v2(CSVTable).
was:
When a binary colunm is written into csv files, actual content of this colunm is {*}object.toString(){*}, which is meaningless.
{code:java}
val df = Seq(Array[Byte](1,2)).toDF
df.write.csv("/Users/guowei19/Desktop/binary_csv")
{code}
The csv file's content is as follows:
!image-2023-01-30-17-21-09-212.png|width=141,height=29!
Meanwhile, if a binary colunm saved as table with csv fileformat, the table can't be read back successfully.
{code:java}
val df = Seq((1, Array[Byte](1,2))).toDF
df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from binaryDataTable").show()
{code}
!https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963&docGuid=Eiscz4oMI45Sfp&sign=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
So I think it' better to change binary to unsupported dataType in csv format, both for datasource v1(CSVFileFormat) and v2(CSVTable).
> change binary to unsupported dataType in csv format
> ---------------------------------------------------
>
> Key: SPARK-42237
> URL: https://issues.apache.org/jira/browse/SPARK-42237
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 2.4.8, 3.3.1
> Reporter: Wei Guo
> Priority: Minor
> Attachments: image-2023-01-30-17-21-09-212.png
>
>
> When a binary colunm is written into csv files, actual content of this colunm is {*}object.toString(){*}, which is meaningless.
> {code:java}
> val df = Seq(Array[Byte](1,2)).toDF
> df.write.csv("/Users/guowei/Desktop/binary_csv")
> {code}
> The csv file's content is as follows:
> !image-2023-01-30-17-21-09-212.png|width=141,height=29!
> Meanwhile, if a binary colunm saved as table with csv fileformat, the table can't be read back successfully.
> {code:java}
> val df = Seq((1, Array[Byte](1,2))).toDF
> df.write.format("csv").saveAsTable("binaryDataTable")spark.sql("select * from binaryDataTable").show()
> {code}
> !https://rte.weiyun.baidu.com/wiki/attach/image/api/imageDownloadAddress?attachId=82da0afc444c41bdaac34418a1c89963&docGuid=Eiscz4oMI45Sfp&sign=eyJhbGciOiJkaXIiLCJlbmMiOiJBMjU2R0NNIiwiYXBwSWQiOjEsInVpZCI6IjgtVWkzU0lMY2wiLCJkb2NJZCI6IkVpc2N6NG9NSTQ1U2ZwIn0..z1O-00hE1tTua9co.RmL0GxEQyNVQbIMYOvyAmQY18NMCxHdGdEPtulFiV3BuqsVlJODgA9-xFY9H9yer_Ckpbt4aG2ZrqgohIq43_ywzj-8u8SKKZnnzm7Dt-EhQBwrA7EhwUveE4-MRcAmsgqRKneN0gUJIu78ogR-M5-GAYqiyd-C-PH0LTaHDhNBWFBkF01kVOLJ18c2VTT6_lbc9j9Drmxj56ouymFgfhdUtpA.cTYqsEvvnKDcIPiah99f_A!
> So I think it' better to change binary to unsupported dataType in csv format, both for datasource v1(CSVFileFormat) and v2(CSVTable).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org