You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2016/08/09 02:39:20 UTC

[jira] [Commented] (SPARK-16963) Change Source API so that sources do not need to keep unbounded state

    [ https://issues.apache.org/jira/browse/SPARK-16963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15412840#comment-15412840 ] 

Apache Spark commented on SPARK-16963:
--------------------------------------

User 'frreiss' has created a pull request for this issue:
https://github.com/apache/spark/pull/14553

> Change Source API so that sources do not need to keep unbounded state
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16963
>                 URL: https://issues.apache.org/jira/browse/SPARK-16963
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 2.0.0
>            Reporter: Frederick Reiss
>
> The version of the Source API in Spark 2.0.0 defines a single getBatch() method for fetching records from the source, with the following Scaladoc comments defining the semantics:
> {noformat}
> /**
>  * Returns the data that is between the offsets (`start`, `end`]. When `start` is `None` then
>  * the batch should begin with the first available record. This method must always return the
>  * same data for a particular `start` and `end` pair.
>  */
> def getBatch(start: Option[Offset], end: Offset): DataFrame
> {noformat}
> These semantics mean that a Source must retain all past history for the stream that it backs. Further, a Source is also required to retain this data across restarts of the process where the Source is instantiated, even when the Source is restarted on a different machine.
> These restrictions make it difficult to implement the Source API, as any implementation requires potentially unbounded amounts of distributed storage.
> See the mailing list thread at [http://apache-spark-developers-list.1001551.n3.nabble.com/Source-API-requires-unbounded-distributed-storage-td18551.html] for more information.
> This JIRA will cover augmenting the Source API with an additional callback that will allow Structured Streaming scheduler to notify the source when it is safe to discard buffered data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org