You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "second_comet@yahoo.com.INVALID" <se...@yahoo.com.INVALID> on 2022/11/10 07:02:58 UTC

cannot write spark log to s3a

when running spark job, i used
     "spark.eventLog.dir": "s3a://_some_bucket_on_prem/spark-history",      "spark.eventLog.enabled": true
i see the log of the job shows
22/11/10 06:42:30 INFO SingleEventLogFileWriter: Logging events to s3a://_some_bucket_on_prem/spark-history/spark-a2befd8cb9134190982a35663b61294b.inprogress22/11/10 06:42:30 WARN S3ABlockOutputStream: Application invoked the Syncable API against stream writing to _some_bucket_on_prem/spark-history/a2befd8cb9134190982a35663b61294b.inprogress. This is unsupported

Does spark 3.3.0 support write to s3a bucket for the log? I can't write the log . It is a on-premise s3a. Do I miss out any jar library? Does it support any cloud blob storage providers?

Re: cannot write spark log to s3a

Posted by Chris Nauroth <cn...@apache.org>.

The Spark event log writer expects to use a Hadoop-compatible file system
that supports the ability to sync [1] previously written data, immediately
making it durable and visible to other clients. The log message is warning
that the S3A file system does not provide this capability. Even if it's
asked to sync, the operation is a no-op. The data won't be visible until
the stream is closed.

To address this, you can switch spark.eventLog.dir to a file system that
does offer this capability, like HDFS. You could also ignore the warning,
but the consequences are that event log data won't be visible until the job
completes, and if it terminates unexpectedly before closing the stream, you
might not get any event data at all.

Different cloud storage providers handle this differently. GCS supports an
option to enable a sync capability [2]. The implementation works by rolling
to a new hidden file when a sync is requested, and composing all such files
to present as a single stream to readers. The additional GCS API calls
required to do this mean that latencies will be longer as compared to HDFS.
Rate limiting can cause individual syncs to revert to no-ops too, meaning
the guarantee is not as strong as HDFS.

For more background on this, see HADOOP-13327 [3] and HADOOP-17597 [4].
Also note from those issues that these checks get stricter starting in
Hadoop 3.3.1. Instead of a warning, the application will fail with an
exception to alert users to potential misbehavior. You can opt back in to
the old warning behavior with an additional configuration property.

[1]
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/outputstream.html#org.apache.hadoop.fs.Syncable
[2]
https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/v2.2.8/gcs/CONFIGURATION.md#io-configuration
[3] https://issues.apache.org/jira/browse/HADOOP-13327
[4] https://issues.apache.org/jira/browse/HADOOP-17597

Chris Nauroth

On Wed, Nov 9, 2022 at 11:04 PM second_comet@yahoo.com.INVALID
<se...@yahoo.com.invalid> wrote:

> when running spark job, i used
>
>      "spark.eventLog.dir": "s3a://_some_bucket_on_prem/spark-history",
>       "spark.eventLog.enabled": true
>
> i see the log of the job shows
>
> 22/11/10 06:42:30 INFO SingleEventLogFileWriter: Logging events to
> s3a://_some_bucket_on_prem/spark-history
> /spark-a2befd8cb9134190982a35663b61294b.inprogress
> 22/11/10 06:42:30 WARN S3ABlockOutputStream: Application invoked the
> Syncable API against stream writing to _some_bucket_on_prem/spark-history/a2befd8cb9134190982a35663b61294b.inprogress.
> This is unsupported
>
>
> Does spark 3.3.0 support write to s3a bucket for the log? I can't write
> the log . It is a on-premise s3a. Do I miss out any jar library? Does it
> support any cloud blob storage providers?
>
>