You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Aniruddha P Tekade <at...@binghamton.edu> on 2020/05/02 00:08:04 UTC

Path style access fs.s3a.path.style.access property is not working in spark code

Hello Users,

I am using on-premise object storage and able to perform operations on
different bucket using aws-cli.
However, when I am trying to use the same path from my spark code, it
fails. Here are the details -

Addes dependencies in build.sbt -

   - hadoop-aws-2.7.4.ja
   - aws-java-sdk-1.7.4.jar

Spark Hadoop Configuration setup as -

spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

And now I try to write data into my custom s3 endpoint as follows -

        val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
      dayofmonth(current_date()) as "day",
      month(current_date()) as "month",
      year(current_date()) as "year",
      column("time"),
      column("quality"),
      column("PM25"))
      .writeStream
      .partitionBy("year", "month", "day")
      .format("csv")
      .outputMode("append")
      .option("path",  "s3a://test-bucket/")
        val streamingQuery: StreamingQuery = dataStreamWriter.start()


However, I am getting en error that AmazonHttpClient is not able to execute
HTTP request and
also it is referring to the bucket-name before the URL. Seems like the
hadoop configuration is not being resolved here -


20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request:
test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com


Is there anything that I am missing here in the configurations? Seems like
even after setting up path style access to true,
it's not working.

--
Aniruddha
-----------
ᐧ

Re: Path style access fs.s3a.path.style.access property is not working in spark code

Posted by Samik Raychaudhuri <sa...@gmail.com>.

Recommend to use v2.9.x, there are lot of optimizations that makes life 
much easier while accessing from Spark.
Thanks.
-Samik

On 05-05-2020 01:55 am, Aniruddha P Tekade wrote:
> Hello User,
>
> I got the solution to this. If you are writing to a custom s3 url, 
> then use the hadoop-aws-2.8.0.jar as the separate flag was introduced 
> to enable path style access.
>
> Best,
> Aniruddha
> -----------
>
> ᐧ
>
> On Fri, May 1, 2020 at 5:08 PM Aniruddha P Tekade 
> <atekade1@binghamton.edu <ma...@binghamton.edu>> wrote:
>
>     Hello Users,
>
>     I am using on-premise object storage and able to perform
>     operations on different bucket using aws-cli.
>     However, when I am trying to use the same path from my spark code,
>     it fails. Here are the details -
>
>     Addes dependencies in build.sbt -
>
>       * |hadoop-aws-2.7.4.ja|
>       * |aws-java-sdk-1.7.4.jar|
>
>     Spark Hadoop Configuration setup as -
>
>         |spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint",ENDPOINT);spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",ACCESS_KEY);spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",SECRET_KEY);spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access","true")|
>
>     And now I try to write data into my custom s3 endpoint as follows -
>
>     |valdataStreamWriter:DataStreamWriter[Row]=PM25quality.select(dayofmonth(current_date())as
>     "day",month(current_date())as "month",year(current_date())as
>     "year",column("time"),column("quality"),column("PM25")).writeStream
>     .partitionBy("year","month","day").format("csv").outputMode("append").option("path","s3a://test-bucket/")valstreamingQuery:StreamingQuery=dataStreamWriter.start()|
>
>
>     However, I am getting en errorthat AmazonHttpClient is not able to
>     execute HTTP requestand
>     also it is referring to the bucket-name before the URL. Seems like
>     the hadoop configuration is not being resolved here -
>
>
>         20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute
>         HTTP request: test-bucket.s3-region0.cloudian.com
>         <http://test-bucket.s3-region0.cloudian.com>
>         java.net.UnknownHostException:
>         test-bucket.s3-region0.cloudian.com
>         <http://test-bucket.s3-region0.cloudian.com>
>
>
>     Is there anything that I am missing here in the configurations?
>     Seems like even after setting up path style access to true,
>     it's not working.
>
>     --
>     Aniruddha
>     -----------
>     ᐧ
>

-- 
Samik Raychaudhuri, Ph.D.
http://in.linkedin.com/in/samikr/

Re: Path style access fs.s3a.path.style.access property is not working in spark code

Posted by Aniruddha P Tekade <at...@binghamton.edu>.

Hello User,

I got the solution to this. If you are writing to a custom s3 url, then use
the hadoop-aws-2.8.0.jar as the separate flag was introduced to enable path
style access.

Best,
Aniruddha
-----------

ᐧ

On Fri, May 1, 2020 at 5:08 PM Aniruddha P Tekade <at...@binghamton.edu>
wrote:

> Hello Users,
>
> I am using on-premise object storage and able to perform operations on
> different bucket using aws-cli.
> However, when I am trying to use the same path from my spark code, it
> fails. Here are the details -
>
> Addes dependencies in build.sbt -
>
>    - hadoop-aws-2.7.4.ja
>    - aws-java-sdk-1.7.4.jar
>
> Spark Hadoop Configuration setup as -
>
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
>
> And now I try to write data into my custom s3 endpoint as follows -
>
>         val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
>       dayofmonth(current_date()) as "day",
>       month(current_date()) as "month",
>       year(current_date()) as "year",
>       column("time"),
>       column("quality"),
>       column("PM25"))
>       .writeStream
>       .partitionBy("year", "month", "day")
>       .format("csv")
>       .outputMode("append")
>       .option("path",  "s3a://test-bucket/")
>         val streamingQuery: StreamingQuery = dataStreamWriter.start()
>
>
> However, I am getting en error that AmazonHttpClient is not able to
> execute HTTP request and
> also it is referring to the bucket-name before the URL. Seems like the
> hadoop configuration is not being resolved here -
>
>
> 20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request:
> test-bucket.s3-region0.cloudian.com
> java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com
>
>
> Is there anything that I am missing here in the configurations? Seems like
> even after setting up path style access to true,
> it's not working.
>
> --
> Aniruddha
> -----------
> ᐧ
>