You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by "Thakrar, Jayesh" <jt...@conversantmedia.com> on 2018/04/06 15:29:30 UTC

Spark 2.3 V2 Datasource API questions

First of all thank you to the Spark dev team for coming up with the standardized and intuitive API interfaces.
I am sure it will encourage integrating a lot more new datasource integration.

I have been creating playing with the API and have some questions on the continuous streaming API
(see https://github.com/JThakrar/sparkconn#continuous-streaming-datasource )

It seems that "commit" is never called

query.status always shows the message below even after the query has been initialized, data has been streaming:
{
  "message" : "Initializing sources",
  "isDataAvailable" : false,
  "isTriggerActive" : true
}


query.recentProgress always shows an empty array:

Array[org.apache.spark.sql.streaming.StreamingQueryProgress] = Array()

And stopping a query always shows as if the tasks were lost involuntarily or uncleanly (even though close on the datasource was called) :
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 2.0 in stage 1.0 (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 1.0 in stage 1.0 (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 3.0 in stage 1.0 (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 0.0 in stage 1.0 (TID 5, localhost, executor driver): TaskKilled (Stage cancelled)
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 4.0 in stage 1.0 (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)

Any pointers/info will be greatly appreciated.




Re: Spark 2.3 V2 Datasource API questions

Posted by "Thakrar, Jayesh" <jt...@conversantmedia.com>.
Thank you Jose for the quick reply!
I have made myself a watcher on them.

From: Joseph Torres <jo...@databricks.com>
Date: Friday, April 6, 2018 at 10:41 AM
To: "Thakrar, Jayesh" <jt...@conversantmedia.com>
Cc: "dev@spark.apache.org" <de...@spark.apache.org>
Subject: Re: Spark 2.3 V2 Datasource API questions

Thanks for trying it out!

We haven't hooked continuous streaming up to query.status or query.recentProgress yet - commit() should be called under the hood, we just don't yet report that it is. I've filed SPARK-23886 and SPARK-23887 to track the work to add those things.

The issue with printing warnings whenever the query is stopped is tracked in SPARK-23444.

Jose

On Fri, Apr 6, 2018 at 8:29 AM, Thakrar, Jayesh <jt...@conversantmedia.com>> wrote:
First of all thank you to the Spark dev team for coming up with the standardized and intuitive API interfaces.
I am sure it will encourage integrating a lot more new datasource integration.

I have been creating playing with the API and have some questions on the continuous streaming API
(see https://github.com/JThakrar/sparkconn#continuous-streaming-datasource )

It seems that "commit" is never called

query.status always shows the message below even after the query has been initialized, data has been streaming:
{
  "message" : "Initializing sources",
  "isDataAvailable" : false,
  "isTriggerActive" : true
}


query.recentProgress always shows an empty array:

Array[org.apache.spark.sql.streaming.StreamingQueryProgress] = Array()

And stopping a query always shows as if the tasks were lost involuntarily or uncleanly (even though close on the datasource was called) :
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 2.0 in stage 1.0 (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 1.0 in stage 1.0 (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 3.0 in stage 1.0 (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 0.0 in stage 1.0 (TID 5, localhost, executor driver): TaskKilled (Stage cancelled)
2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 4.0 in stage 1.0 (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)

Any pointers/info will be greatly appreciated.





Re: Spark 2.3 V2 Datasource API questions

Posted by Joseph Torres <jo...@databricks.com>.
Thanks for trying it out!

We haven't hooked continuous streaming up to query.status or
query.recentProgress yet - commit() should be called under the hood, we
just don't yet report that it is. I've filed SPARK-23886 and SPARK-23887 to
track the work to add those things.

The issue with printing warnings whenever the query is stopped is tracked
in SPARK-23444.

Jose

On Fri, Apr 6, 2018 at 8:29 AM, Thakrar, Jayesh <
jthakrar@conversantmedia.com> wrote:

> First of all thank you to the Spark dev team for coming up with the
> standardized and intuitive API interfaces.
>
> I am sure it will encourage integrating a lot more new datasource
> integration.
>
>
>
> I have been creating playing with the API and have some questions on the
> continuous streaming API
>
> (see https://github.com/JThakrar/sparkconn#continuous-streaming-datasource
> )
>
>
>
> *It seems that "commit" is never called *
>
>
>
> *query.status always shows the message below even after the query has been
> initialized, data has been streaming:*
>
> {
>
>   "message" : "Initializing sources",
>
>   "isDataAvailable" : false,
>
>   "isTriggerActive" : true
>
> }
>
>
>
>
>
> *query.recentProgress always shows an empty array:*
>
>
>
> Array[org.apache.spark.sql.streaming.StreamingQueryProgress] = Array()
>
>
>
> *And stopping a query always shows as if the tasks were lost involuntarily
> or uncleanly (even though close on the datasource was called) :*
>
> 2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 2.0 in stage 1.0
> (TID 7, localhost, executor driver): TaskKilled (Stage cancelled)
>
> 2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 1.0 in stage 1.0
> (TID 6, localhost, executor driver): TaskKilled (Stage cancelled)
>
> 2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 3.0 in stage 1.0
> (TID 8, localhost, executor driver): TaskKilled (Stage cancelled)
>
> 2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 0.0 in stage 1.0
> (TID 5, localhost, executor driver): TaskKilled (Stage cancelled)
>
> 2018-04-06 08:07:10 WARN  TaskSetManager:66 - Lost task 4.0 in stage 1.0
> (TID 9, localhost, executor driver): TaskKilled (Stage cancelled)
>
>
>
> Any pointers/info will be greatly appreciated.
>
>
>
>
>
>
>