You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2018/08/22 15:58:00 UTC
[jira] [Commented] (SPARK-25187) Revisit the life cycle of ReadSupport instances.

    [ https://issues.apache.org/jira/browse/SPARK-25187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16589060#comment-16589060 ] 

Ryan Blue commented on SPARK-25187:
-----------------------------------

The need for {{newScanConfigBuilder}} to take key-value options doesn't require a change to the life-cycle of {{ReadSupport}} instances. There are options that are related to scan configuration, and not to source configuration. If data sources are free to reuse {{ReadSupport}} instances, then scan options must be passed to configure the scan.

HBase provides a good example of the difference. HBase table options would include where the data lives, like the HBase host to connect to. HBase scan options would include the MVCC timestamp to request for a scan. A HBase ReadSupport can be reused, which means that the MVCC timestamp used should be one passed to the scan, not the one passed to when creating the {{ReadSupport}}.

I understand that this is a little confusing because right now both sets of options are mixed together. The only way to set these options is to pass them to the {{DataFrameReader}}. That makes it appear that there is only one set of options for a source. But, consider sources that are stored in the the session catalog. Those sources are stored with source/table configuration, the {{OPTIONS}} passed in when creating the table. When reading these tables, we can also pass options to the {{DataFrameReader}}, which need to be passed when creating a scan of those sources.

> Revisit the life cycle of ReadSupport instances.
> ------------------------------------------------
>
>                 Key: SPARK-25187
>                 URL: https://issues.apache.org/jira/browse/SPARK-25187
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Wenchen Fan
>            Priority: Major
>
> Currently the life cycle is bound to the batch/stream query. This fits streaming very well but may not be perfect for batch source. We can also consider to let {{ReadSupport.newScanConfigBuilder}} take {{DataSourceOptions}} as parameter, if we decide to change the life cycle.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org