You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ivan Sadikov (Jira)" <ji...@apache.org> on 2022/10/14 03:37:00 UTC
[jira] [Commented] (SPARK-40637) DataFrame can correctly encode BINARY type but SparkSQL cannot

    [ https://issues.apache.org/jira/browse/SPARK-40637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617435#comment-17617435 ] 

Ivan Sadikov commented on SPARK-40637:
--------------------------------------

You are not writing to the table in the first example but you are in the second example which could be an ORC bug. Could you try to run the same code in Scala and SQL? Also, I don't see any output for the second command.

> DataFrame can correctly encode BINARY type but SparkSQL cannot
> --------------------------------------------------------------
>
>                 Key: SPARK-40637
>                 URL: https://issues.apache.org/jira/browse/SPARK-40637
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.2.1
>            Reporter: xsys
>            Priority: Major
>
> h3. Describe the bug
> Storing a BINARY value (e.g. {{BigInt("1").toByteArray)}} / {{{}X'01'{}}}) via {{spark-shell}} outputs {{{}[01]{}}}. However, it does not encode correctly if the value is inserted into a BINARY column of a table via {{{}spark-sql{}}}.
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-shell{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-shell{code}
> Execute the following:
> {code:java}
> scala> val rdd = sc.parallelize(Seq(Row(BigInt("1").toByteArray)))
> rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[356] at parallelize at <console>:28
> scala> val schema = new StructType().add(StructField("c1", BinaryType, true))
> schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,BinaryType,true))
> scala> val df = spark.createDataFrame(rdd, schema)
> df: org.apache.spark.sql.DataFrame = [c1: binary]
> scala> df.show(false)
> +----+
> |c1  |
> +----+
> |[01]|
> +----+
> {code}
>  
> Using {{{}spark-sql{}}}:
> {code:java}
> $SPARK_HOME/bin/spark-sql{code}
>  Execute the following, we only get an empty output:
> {code:java}
> spark-sql> create table binary_vals(c1 BINARY) stored as ORC;
> spark-sql> insert into binary_vals select X'01';
> spark-sql> select * from binary_vals;
> Time taken: 0.077 seconds, Fetched 1 row(s)
> {code}
> h3. Expected behavior
> We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type ({{{}BINARY{}}}) & input ({{{}BigInt("1").toByteArray){}}} / {{{}X'01'{}}}) combination.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org