You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/02/10 10:33:00 UTC

[jira] [Assigned] (SPARK-34416) Support avroSchemaUrl in addition to avroSchema

     [ https://issues.apache.org/jira/browse/SPARK-34416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-34416:
------------------------------------

    Assignee: Apache Spark

> Support avroSchemaUrl in addition to avroSchema
> -----------------------------------------------
>
>                 Key: SPARK-34416
>                 URL: https://issues.apache.org/jira/browse/SPARK-34416
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0, 2.4.0, 3.2.0
>            Reporter: Ohad Raviv
>            Assignee: Apache Spark
>            Priority: Minor
>
> We have a use case in which we read a huge table in Avro format. About 30k columns.
> using the default Hive reader - `AvroGenericRecordReader` it is just hangs forever. after 4 hours not even one task has finished.
> We tried instead to use `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:
> ```
> org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema
> ..
> at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
>  at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
>  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
>  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
>  ... 53 elided
> ```
>  
> because files schema contain duplicate column names (when considering case-insensitive).
> So we wanted to provide a user schema with non-duplicated fields, but the schema is huge. a few MBs. it is not practical to provide it in json format.
>  
> So we patched spark-avro to be able to get also `avroSchemaUrl` in addition to `avroSchema` and it worked perfectly.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org