You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Jungtaek Lim (Jira)" <ji...@apache.org> on 2020/06/08 23:08:00 UTC

[jira] [Commented] (SPARK-31931) When using GCS as checkpoint location for Structured Streaming aggregation pipeline, the Spark writing job is aborted

    [ https://issues.apache.org/jira/browse/SPARK-31931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128705#comment-17128705 ] 

Jungtaek Lim commented on SPARK-31931:
--------------------------------------

Critical+ is reserved for committers. Lowering the priority.

The checkpoint mechanism uses atomic rename by default which may not be supported by object stores. Known unsupported one is S3, and I guess GCS might be just another one. I feel it's a kind of "good to have", instead of "essential" one.

Maybe you can get better answer from user mailing list, instead of filing the issue. As you're using GCP, consulting with Google might be the best try.

> When using GCS as checkpoint location for Structured Streaming aggregation pipeline, the Spark writing job is aborted
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-31931
>                 URL: https://issues.apache.org/jira/browse/SPARK-31931
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.4.5
>         Environment: GCP Dataproc 1.5 Debian 10 (Hadoop 2.10.0, Spark 2.4.5, Cloud Storage Connector hadoop2.2.1.3, Scala 2.12.10)
>            Reporter: Adrian Jones
>            Priority: Major
>         Attachments: spark-structured-streaming-error
>
>
> Structured streaming checkpointing does not work with Google Cloud Storage when there are aggregations included in the streaming pipeline.
> Using GCS as the external store works fine when there are no aggregations present in the pipeline (i.e. groupBy); however, once an aggregation is introduced, the attached error is thrown.
> The error is only thrown when aggregating and pointing checkpointLocation to GCS. The exact code works fine when pointing checkpointLocation to HDFS.
> Is it expected for GCS to function as a checkpoint location for aggregated pipelines? Are efforts currently in progress to enable this? Is it on a roadmap?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org