You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aniruddha P Tekade <at...@binghamton.edu> on 2020/05/02 00:08:04 UTC
Path style access fs.s3a.path.style.access property is not working in
spark code
Hello Users,
I am using on-premise object storage and able to perform operations on
different bucket using aws-cli.
However, when I am trying to use the same path from my spark code, it
fails. Here are the details -
Addes dependencies in build.sbt -
- hadoop-aws-2.7.4.ja
- aws-java-sdk-1.7.4.jar
Spark Hadoop Configuration setup as -
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
And now I try to write data into my custom s3 endpoint as follows -
val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
dayofmonth(current_date()) as "day",
month(current_date()) as "month",
year(current_date()) as "year",
column("time"),
column("quality"),
column("PM25"))
.writeStream
.partitionBy("year", "month", "day")
.format("csv")
.outputMode("append")
.option("path", "s3a://test-bucket/")
val streamingQuery: StreamingQuery = dataStreamWriter.start()
However, I am getting en error that AmazonHttpClient is not able to execute
HTTP request and
also it is referring to the bucket-name before the URL. Seems like the
hadoop configuration is not being resolved here -
20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request:
test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com
Is there anything that I am missing here in the configurations? Seems like
even after setting up path style access to true,
it's not working.
--
Aniruddha
-----------
ᐧ
Re: Path style access fs.s3a.path.style.access property is not
working in spark code
Posted by Samik Raychaudhuri <sa...@gmail.com>.
Recommend to use v2.9.x, there are lot of optimizations that makes life
much easier while accessing from Spark.
Thanks.
-Samik
On 05-05-2020 01:55 am, Aniruddha P Tekade wrote:
> Hello User,
>
> I got the solution to this. If you are writing to a custom s3 url,
> then use the hadoop-aws-2.8.0.jar as the separate flag was introduced
> to enable path style access.
>
> Best,
> Aniruddha
> -----------
>
> ᐧ
>
> On Fri, May 1, 2020 at 5:08 PM Aniruddha P Tekade
> <atekade1@binghamton.edu <ma...@binghamton.edu>> wrote:
>
> Hello Users,
>
> I am using on-premise object storage and able to perform
> operations on different bucket using aws-cli.
> However, when I am trying to use the same path from my spark code,
> it fails. Here are the details -
>
> Addes dependencies in build.sbt -
>
> * |hadoop-aws-2.7.4.ja|
> * |aws-java-sdk-1.7.4.jar|
>
> Spark Hadoop Configuration setup as -
>
> |spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint",ENDPOINT);spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",ACCESS_KEY);spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",SECRET_KEY);spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access","true")|
>
> And now I try to write data into my custom s3 endpoint as follows -
>
> |valdataStreamWriter:DataStreamWriter[Row]=PM25quality.select(dayofmonth(current_date())as
> "day",month(current_date())as "month",year(current_date())as
> "year",column("time"),column("quality"),column("PM25")).writeStream
> .partitionBy("year","month","day").format("csv").outputMode("append").option("path","s3a://test-bucket/")valstreamingQuery:StreamingQuery=dataStreamWriter.start()|
>
>
> However, I am getting en errorthat AmazonHttpClient is not able to
> execute HTTP requestand
> also it is referring to the bucket-name before the URL. Seems like
> the hadoop configuration is not being resolved here -
>
>
> 20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute
> HTTP request: test-bucket.s3-region0.cloudian.com
> <http://test-bucket.s3-region0.cloudian.com>
> java.net.UnknownHostException:
> test-bucket.s3-region0.cloudian.com
> <http://test-bucket.s3-region0.cloudian.com>
>
>
> Is there anything that I am missing here in the configurations?
> Seems like even after setting up path style access to true,
> it's not working.
>
> --
> Aniruddha
> -----------
> ᐧ
>
--
Samik Raychaudhuri, Ph.D.
http://in.linkedin.com/in/samikr/
Re: Path style access fs.s3a.path.style.access property is not
working in spark code
Posted by Aniruddha P Tekade <at...@binghamton.edu>.
Hello User,
I got the solution to this. If you are writing to a custom s3 url, then use
the hadoop-aws-2.8.0.jar as the separate flag was introduced to enable path
style access.
Best,
Aniruddha
-----------
ᐧ
On Fri, May 1, 2020 at 5:08 PM Aniruddha P Tekade <at...@binghamton.edu>
wrote:
> Hello Users,
>
> I am using on-premise object storage and able to perform operations on
> different bucket using aws-cli.
> However, when I am trying to use the same path from my spark code, it
> fails. Here are the details -
>
> Addes dependencies in build.sbt -
>
> - hadoop-aws-2.7.4.ja
> - aws-java-sdk-1.7.4.jar
>
> Spark Hadoop Configuration setup as -
>
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
>
> And now I try to write data into my custom s3 endpoint as follows -
>
> val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
> dayofmonth(current_date()) as "day",
> month(current_date()) as "month",
> year(current_date()) as "year",
> column("time"),
> column("quality"),
> column("PM25"))
> .writeStream
> .partitionBy("year", "month", "day")
> .format("csv")
> .outputMode("append")
> .option("path", "s3a://test-bucket/")
> val streamingQuery: StreamingQuery = dataStreamWriter.start()
>
>
> However, I am getting en error that AmazonHttpClient is not able to
> execute HTTP request and
> also it is referring to the bucket-name before the URL. Seems like the
> hadoop configuration is not being resolved here -
>
>
> 20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request:
> test-bucket.s3-region0.cloudian.com
> java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com
>
>
> Is there anything that I am missing here in the configurations? Seems like
> even after setting up path style access to true,
> it's not working.
>
> --
> Aniruddha
> -----------
> ᐧ
>