You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Dongjoon Hyun (JIRA)" <ji...@apache.org> on 2019/07/16 16:42:13 UTC

[jira] [Updated] (SPARK-24855) Built-in AVRO support should support specified schema on write

     [ https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-24855:
----------------------------------
    Affects Version/s:     (was: 2.4.0)
                       3.0.0

> Built-in AVRO support should support specified schema on write
> --------------------------------------------------------------
>
>                 Key: SPARK-24855
>                 URL: https://issues.apache.org/jira/browse/SPARK-24855
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Brian Lindblom
>            Assignee: Brian Lindblom
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> spark-avro appears to have been brought in from an upstream project, [https://github.com/databricks/spark-avro.]  I opened a PR a while ago to enable support for 'forceSchema', which allows us to specify an AVRO schema with which to write our records to handle some use cases we have.  I didn't get this code merged but would like to add this feature to the AVRO reader/writer code that was brought in.  The PR is here and I will follow up with a more formal PR/Patch rebased on spark master branch: https://github.com/databricks/spark-avro/pull/222
>  
> This change allows us to specify a schema, which should be compatible with the schema generated by spark-avro from the dataset definition.  This allows a user to do things like specify default values, change union ordering, or... in the case where you're reading in an AVRO data set, doing some sort of in-line field cleansing, then writing out with the original schema, preserve that original schema in the output container files.  I've had several use cases where this behavior was desired and there were several other asks for this in the spark-avro project.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org