You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/01 20:25:59 UTC

[jira] [Updated] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

     [ https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin updated SPARK-16963:
--------------------------------
    Fix Version/s:     (was: 2.0.3)
                   2.0.2

> Change Source API so that sources do not need to keep unbounded state
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16963
>                 URL: https://issues.apache.org/jira/browse/SPARK-16963
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Streaming
>    Affects Versions: 2.0.0, 2.0.1
>            Reporter: Frederick Reiss
>            Assignee: Frederick Reiss
>             Fix For: 2.0.2, 2.1.0
>
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() method for fetching records from the source, with the following Scaladoc comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When `start` is `None` then
>  * the batch should begin with the first available record. This method must always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the stream that it backs. Further, a Source is also required to retain this data across restarts of the process where the Source is instantiated, even when the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html] for more information.
> This JIRA will cover augmenting the Source API with an additional callback that will allow Structured Streaming scheduler to notify the source when it is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org