You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aniruddha P Tekade <at...@binghamton.edu> on 2019/12/04 01:21:44 UTC

spark writeStream not working with custom S3 endpoint

Hello,

While working with Spark Structured Streaming (v2.4.3) I am trying to write
my streaming dataframe to a custom S3. I have made sure that I am able to
login, upload data to s3 buckets manually using UI and have also setup
ACCESS_KEY and SECRET_KEY for it.

val sc = spark.sparkContext
sc.hadoopConfiguration.set("fs.s3a.endpoint",
"s3-region1.myObjectStore.com:443")
sc.hadoopConfiguration.set("fs.s3a.access.key", "00cce9eb2c589b1b1b5b")
sc.hadoopConfiguration.set("fs.s3a.secrete.key",
"flmheKX9Gb1tTlImO6xR++9kvnUByfRKZfI7LJT8")
sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true") //
bucket name appended as url/bucket and not bucket.url

val writeToS3Query = stream.writeStream
      .format("csv")
      .option("sep", ",")
      .option("header", true)
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("30 seconds"))
      .option("path", "s3a://bucket0/")
      .option("checkpointLocation", "/Users/home/checkpoints/s3-checkpointing")
      .start()

However, I am getting the error that

Unable to execute HTTP request: bucket0.s3-region1.myObjectStore.com:
nodename nor servname provided, or not known

I have mapping of URL and IP in my /etc/hosts file and the bucket is
accessable from other sources. Is there any other way to do this
successfully? I am really not sure why bucket name is being appended before
URL when it is executed by Spark.

Can this be because I am setting up the spark context hadoop configurations
after session is created and so they are not effective? But then how it is
able to refer the actual URL when in the path I am providing value as
s3a://bucket0.

Best,
Aniruddha
-----------
ᐧ