You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/11/01 20:25:59 UTC
[jira] [Updated] (SPARK-16963) Change Source API so that sources do
not need to keep unbounded state
[ https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Reynold Xin updated SPARK-16963:
--------------------------------
Fix Version/s: (was: 2.0.3)
2.0.2
> Change Source API so that sources do not need to keep unbounded state
> ---------------------------------------------------------------------
>
> Key: SPARK-16963
> URL: https://issues.apache.org/jira/browse/SPARK-16963
> Project: Spark
> Issue Type: Sub-task
> Components: Streaming
> Affects Versions: 2.0.0, 2.0.1
> Reporter: Frederick Reiss
> Assignee: Frederick Reiss
> Fix For: 2.0.2, 2.1.0
>
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() method for fetching records from the source, with the following Scaladoc comments defining the semantics:
> {noformat}
> /**
> * Returns the data that is between the offsets (`start`, `end`]. When `start` is `None` then
> * the batch should begin with the first available record. This method must always return the
> * same data for a particular `start` and `end` pair.
> */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the stream that it backs. Further, a Source is also required to retain this data across restarts of the process where the Source is instantiated, even when the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html] for more information.
> This JIRA will cover augmenting the Source API with an additional callback that will allow Structured Streaming scheduler to notify the source when it is safe to discard buffered data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org