You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@phoenix.apache.org by "Istvan Toth (Jira)" <ji...@apache.org> on 2022/11/29 13:08:00 UTC

[jira] [Commented] (PHOENIX-6667) Spark3 connector requires that all columns are specified when writing

    [ https://issues.apache.org/jira/browse/PHOENIX-6667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17640694#comment-17640694 ] 

Istvan Toth commented on PHOENIX-6667:
--------------------------------------

Replying to an offline question from Spark folks on why this is a problem:


One issue is simply backwards compatibility, this used to work for Spark 2.0

Another issue is semantics: 
SQL tables have both nullable fields, and fields with defaults, which do not need to be specified when adding records.
Especially for big data, sparse rows are normal.
The Spark3 behaviour breaks these conventions by requiring to specify a value for each column.

Also, Phoenix uses upserts instead of updates, so the normal way to update values is to specify only the primary key fields and the changed fields.
If you have to specify all fields, then you can't use this feature, or you need to read every changed row full first, which kills performance, and may cause races.

The third issue is performance:
By requiring that every column is specified, we increase the amount of data that has to be processed by the application, Spark, the connector and Phoenix/HBase.
This requires memory, CPU, network resources, increases GC pressure etc.

> Spark3 connector requires that all columns are specified when writing
> ---------------------------------------------------------------------
>
>                 Key: PHOENIX-6667
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6667
>             Project: Phoenix
>          Issue Type: Bug
>          Components: connectors, spark-connector
>    Affects Versions: connectors-6.0.0
>            Reporter: Istvan Toth
>            Priority: Major
>
> For Spark 2, it was possible to omit some columns from the dataframe, the same way it is not mandatory to specify all columns when upserting via SQL.
> Spark3 has added new checks, which require that EVERY sql column is specifed in the DataFrame.
> Consequently, when using the current API, writing will fail unless you specify all columns.
> This is a loss of functionality WRT Phoenix (and other SQL datastores) compared to Spark2.
> I don't think that we can do anything from the Phoenix side, just documenting the regression here.
> Maybe future Spark versions will make this configurable.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)