You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sebastian Herold (Jira)" <ji...@apache.org> on 2020/09/15 20:57:00 UTC
[jira] [Created] (SPARK-32895) DataSourceV2 allow ACCEPT_ANY_SCHEMA in write path

Sebastian Herold created SPARK-32895:
----------------------------------------

             Summary: DataSourceV2 allow ACCEPT_ANY_SCHEMA in write path
                 Key: SPARK-32895
                 URL: https://issues.apache.org/jira/browse/SPARK-32895
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.0.0
            Reporter: Sebastian Herold


During the development of a Spark-Collibra-Connector using the DataSourceV2 framework, I found some blocking limitation in the current version.

The connector should accept DataFrames of arbitrary schemas and send them to the Import API of Collibra. The problem is the method {{inferSchema}} of the {{TableProvider}}. Although, my {{Table}} implementation has the capability to {{ACCEPT_ANY_SCHEMA}}. I need to infer the schema without knowing the actual schema of the data frame. This is impossible. The behaviour is maybe intended, if you are writing to an existing table with fix schema, but not if you accept any schema. Such cases cannot be implemented, right now. I found in [{{DataFrameWriter.scala}}|https://github.com/apache/spark/blob/4fac6d501a5d97530edb712ff3450890ac10e413/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L333] that for data sources inherited from {{FileDataSourceV2}} there is an exception and {{inferSchema}} is not called on the write path and {{getTable}} is called with the schema of the actual data frame. This is the reason why it works for these data source derived from {{FileDataSourceV2}}. I would expect a similar behaviour for my data source which has the capability to accept any schema. The problem is that the capabilities are retrieved by the {{Table}} implementation, but to get a table via {{getTable}} you need a schema. I guess the interface should be designed differently:
* two different methods to infer the schema: 
** one for the read path like the current implementation
** one for the write path getting the actual schema of the data frame as parameter, this allows the implementation to decide:
*** Do I accept all schemas and just the return the schema of the data frame?
*** Do I know the schema of the target and ignore the schema of the actual data frame?
*** Can the schema of the target be evolved and I check the schema of the data frame to be a valid evolution of the target schema?

If you agree, I'm willing to make a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org