You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Philip Adetiloye (Jira)" <ji...@apache.org> on 2023/04/19 18:59:00 UTC
[jira] [Updated] (SPARK-43201) Inconsistency between from_avro and from_json function

     [ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Philip Adetiloye updated SPARK-43201:
-------------------------------------
    Description: 
Spark from_avro function does not allow schema to use dataframe column but takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" 

val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show()


// Output:
// +------------+
// |  parsedData|
// +------------+
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// +------------+
 {code}
 

  was:
Spark from_avro function does not allow schema to use dataframe column but takes a String schema:
{code:java}
def from_avro(col: Column, jsonFormatSchema: String): Column {code}
This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. 

 

Here is what I would expect:
{code:java}
def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
code example:
{code:java}
import org.apache.spark.sql.functions.from_avro

val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""val 

val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""


val df = Seq(
  (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
  (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
).toDF("binaryData", "schema")


val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show()


// Output:
// +------------+
// |  parsedData|
// +------------+
// |[apple1, 1.0]|
// |[apple2, 2.0]|
// +------------+
 {code}
 


> Inconsistency between from_avro and from_json function
> ------------------------------------------------------
>
>                 Key: SPARK-43201
>                 URL: https://issues.apache.org/jira/browse/SPARK-43201
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.4.0
>            Reporter: Philip Adetiloye
>            Priority: Major
>
> Spark from_avro function does not allow schema to use dataframe column but takes a String schema:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: String): Column {code}
> This makes it impossible to deserialize rows of Avro records with different schema since only one schema string could be pass externally. 
>  
> Here is what I would expect:
> {code:java}
> def from_avro(col: Column, jsonFormatSchema: Column): Column  {code}
> code example:
> {code:java}
> import org.apache.spark.sql.functions.from_avro
> val avroSchema1 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" 
> val avroSchema2 = """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}"""
> val df = Seq(
>   (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1),
>   (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2)
> ).toDF("binaryData", "schema")
> val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData"))parsed.show()
> // Output:
> // +------------+
> // |  parsedData|
> // +------------+
> // |[apple1, 1.0]|
> // |[apple2, 2.0]|
> // +------------+
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org