You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Scott Reynolds <sr...@twilio.com> on 2015/10/15 20:04:34 UTC

s3a file system and spark deployment mode

List,

Right now we build our spark jobs with the s3a hadoop client. We do this
because our machines are only allowed to use IAM access to the s3 store. We
can build our jars with the s3a filesystem and the aws sdk just fine and
this jars run great in *client mode*.

We would like to move from client mode to cluster mode as that will allow
us to be more resilient to driver failure. In order to do this either:
1. the jar file has to be on worker's local disk
2. the jar file is in shared storage (s3a)

We would like to put the jar file in s3 storage, but when we give the jar
path as s3a://......, the worker node doesn't have the hadoop s3a and aws
sdk in its classpath / uber jar.

Other then building spark with those two dependencies, what other options
do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a thing.

Need to get s3a access to both the master (so that we can log spark event
log to s3) and to the worker processes (driver, executor).

Looking for ideas before just adding the dependencies to our spark build
and calling it a day.

Re: s3a file system and spark deployment mode

Posted by Raghavendra Pandey <ra...@gmail.com>.

You can add classpath info in hadoop env file...

Add the following line to your $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*

Add the following line to $SPARK_HOME/conf/spark-env.sh
export SPARK_DIST_CLASSPATH=$($HADOOP_HOME/bin/hadoop --config
$HADOOP_HOME/etc/hadoop classpath)


This is how you set up hadoop 2.7.1 and spark 1.5.1 with no hadoop. This
will also put necessary jars to your classpath to access s3a.

Also, please note that you need to set fs.s3a.access.key
and fs.s3a.secret.key property into your core-site.xml, rather
than fs.s3a.awsSecretAccessKey and fs.s3a.awsAccessKeyId as mentioned in
the docs.

Good luck
-Raghav

On Fri, Oct 16, 2015 at 9:07 PM, Scott Reynolds <sr...@twilio.com>
wrote:

> hmm I tried using --jars and that got passed to MasterArguments and that
> doesn't work :-(
>
>
> https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala
>
> Same with Worker:
> https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala
>
> Both Master and Worker have to start with these two jars because
> a.) the Master has to serve the event log in s3
> b.) the Worker runs the Driver and has to download the jar from s3
>
> And yes I am using these deps:
>
> <!--
>                 Though spark uses 2.6.0, we need the latest because of
> this:
>
> http://stackoverflow.com/questions/32230039/apache-spark-hangs-after-writing-parquet-file-to-s3-bucket
>             -->
>             <dependency>
>                 <groupId>org.apache.hadoop</groupId>
>                 <artifactId>hadoop-aws</artifactId>
>                 <version>2.7.1</version>
>             </dependency>
>
>             <dependency>
>                 <groupId>com.amazonaws</groupId>
>                 <artifactId>aws-java-sdk</artifactId>
>                 <version>1.7.4</version>
>             </dependency>
>
> I think I have settled on just modifying the java command line that starts
> up the worker and master. Just seems easier. Currently launching them with
> spark-class bash script
>
> /mnt/services/spark/bin/spark-class org.apache.spark.deploy.master.Master \
>     --ip `hostname -i` --port 7077 --webui-port 8080
>
> If all else fails I will update the spark pom and and include it in the
> shaded spark jar.
>
> On Fri, Oct 16, 2015 at 2:25 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>>
>> > On 15 Oct 2015, at 19:04, Scott Reynolds <sr...@twilio.com> wrote:
>> >
>> > List,
>> >
>> > Right now we build our spark jobs with the s3a hadoop client. We do
>> this because our machines are only allowed to use IAM access to the s3
>> store. We can build our jars with the s3a filesystem and the aws sdk just
>> fine and this jars run great in *client mode*.
>> >
>> > We would like to move from client mode to cluster mode as that will
>> allow us to be more resilient to driver failure. In order to do this either:
>> > 1. the jar file has to be on worker's local disk
>> > 2. the jar file is in shared storage (s3a)
>> >
>> > We would like to put the jar file in s3 storage, but when we give the
>> jar path as s3a://......, the worker node doesn't have the hadoop s3a and
>> aws sdk in its classpath / uber jar.
>> >
>> > Other then building spark with those two dependencies, what other
>> options do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a
>> thing.
>> >
>> > Need to get s3a access to both the master (so that we can log spark
>> event log to s3) and to the worker processes (driver, executor).
>> >
>> > Looking for ideas before just adding the dependencies to our spark
>> build and calling it a day.
>>
>>
>> you can use --jars to add these, e.g
>>
>> -jars hadoop-aws.jar,aws-java-sdk-s3
>>
>>
>> as others have warned, you need Hadoop 2.7.1 for s3a to work proplery
>>
>
>

Re: s3a file system and spark deployment mode

Posted by Scott Reynolds <sr...@twilio.com>.

hmm I tried using --jars and that got passed to MasterArguments and that
doesn't work :-(

https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala

Same with Worker:
https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala

Both Master and Worker have to start with these two jars because
a.) the Master has to serve the event log in s3
b.) the Worker runs the Driver and has to download the jar from s3

And yes I am using these deps:

<!--
                Though spark uses 2.6.0, we need the latest because of this:

http://stackoverflow.com/questions/32230039/apache-spark-hangs-after-writing-parquet-file-to-s3-bucket
            -->
            <dependency>
                <groupId>org.apache.hadoop</groupId>
                <artifactId>hadoop-aws</artifactId>
                <version>2.7.1</version>
            </dependency>

            <dependency>
                <groupId>com.amazonaws</groupId>
                <artifactId>aws-java-sdk</artifactId>
                <version>1.7.4</version>
            </dependency>

I think I have settled on just modifying the java command line that starts
up the worker and master. Just seems easier. Currently launching them with
spark-class bash script

/mnt/services/spark/bin/spark-class org.apache.spark.deploy.master.Master \
    --ip `hostname -i` --port 7077 --webui-port 8080

If all else fails I will update the spark pom and and include it in the
shaded spark jar.

On Fri, Oct 16, 2015 at 2:25 AM, Steve Loughran <st...@hortonworks.com>
wrote:

>
> > On 15 Oct 2015, at 19:04, Scott Reynolds <sr...@twilio.com> wrote:
> >
> > List,
> >
> > Right now we build our spark jobs with the s3a hadoop client. We do this
> because our machines are only allowed to use IAM access to the s3 store. We
> can build our jars with the s3a filesystem and the aws sdk just fine and
> this jars run great in *client mode*.
> >
> > We would like to move from client mode to cluster mode as that will
> allow us to be more resilient to driver failure. In order to do this either:
> > 1. the jar file has to be on worker's local disk
> > 2. the jar file is in shared storage (s3a)
> >
> > We would like to put the jar file in s3 storage, but when we give the
> jar path as s3a://......, the worker node doesn't have the hadoop s3a and
> aws sdk in its classpath / uber jar.
> >
> > Other then building spark with those two dependencies, what other
> options do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a
> thing.
> >
> > Need to get s3a access to both the master (so that we can log spark
> event log to s3) and to the worker processes (driver, executor).
> >
> > Looking for ideas before just adding the dependencies to our spark build
> and calling it a day.
>
>
> you can use --jars to add these, e.g
>
> -jars hadoop-aws.jar,aws-java-sdk-s3
>
>
> as others have warned, you need Hadoop 2.7.1 for s3a to work proplery
>

Re: s3a file system and spark deployment mode

Posted by Steve Loughran <st...@hortonworks.com>.

> On 15 Oct 2015, at 19:04, Scott Reynolds <sr...@twilio.com> wrote:
> 
> List,
> 
> Right now we build our spark jobs with the s3a hadoop client. We do this because our machines are only allowed to use IAM access to the s3 store. We can build our jars with the s3a filesystem and the aws sdk just fine and this jars run great in *client mode*. 
> 
> We would like to move from client mode to cluster mode as that will allow us to be more resilient to driver failure. In order to do this either:
> 1. the jar file has to be on worker's local disk
> 2. the jar file is in shared storage (s3a)
> 
> We would like to put the jar file in s3 storage, but when we give the jar path as s3a://......, the worker node doesn't have the hadoop s3a and aws sdk in its classpath / uber jar.
> 
> Other then building spark with those two dependencies, what other options do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a thing.
> 
> Need to get s3a access to both the master (so that we can log spark event log to s3) and to the worker processes (driver, executor).
> 
> Looking for ideas before just adding the dependencies to our spark build and calling it a day.


you can use --jars to add these, e.g

-jars hadoop-aws.jar,aws-java-sdk-s3


as others have warned, you need Hadoop 2.7.1 for s3a to work proplery

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: s3a file system and spark deployment mode

Posted by Raghavendra Pandey <ra...@gmail.com>.

You can use spark 1.5.1 with no hadoop and hadoop 2.7.1..
Hadoop 2.7.1 is more mature for s3a access. You also need to set hadoop
tools dir into hadoop classpath...

Raghav
On Oct 16, 2015 1:09 AM, "Scott Reynolds" <sr...@twilio.com> wrote:

> We do not use EMR. This is deployed on Amazon VMs
>
> We build Spark with Hadoop-2.6.0 but that does not include the s3a
> filesystem nor the Amazon AWS SDK
>
> On Thu, Oct 15, 2015 at 12:26 PM, Spark Newbie <sp...@gmail.com>
> wrote:
>
>> Are you using EMR?
>> You can install Hadoop-2.6.0 along with Spark-1.5.1 in your EMR cluster.
>> And that brings s3a jars to the worker nodes and it becomes available to
>> your application.
>>
>> On Thu, Oct 15, 2015 at 11:04 AM, Scott Reynolds <sr...@twilio.com>
>> wrote:
>>
>>> List,
>>>
>>> Right now we build our spark jobs with the s3a hadoop client. We do this
>>> because our machines are only allowed to use IAM access to the s3 store. We
>>> can build our jars with the s3a filesystem and the aws sdk just fine and
>>> this jars run great in *client mode*.
>>>
>>> We would like to move from client mode to cluster mode as that will
>>> allow us to be more resilient to driver failure. In order to do this either:
>>> 1. the jar file has to be on worker's local disk
>>> 2. the jar file is in shared storage (s3a)
>>>
>>> We would like to put the jar file in s3 storage, but when we give the
>>> jar path as s3a://......, the worker node doesn't have the hadoop s3a and
>>> aws sdk in its classpath / uber jar.
>>>
>>> Other then building spark with those two dependencies, what other
>>> options do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a
>>> thing.
>>>
>>> Need to get s3a access to both the master (so that we can log spark
>>> event log to s3) and to the worker processes (driver, executor).
>>>
>>> Looking for ideas before just adding the dependencies to our spark build
>>> and calling it a day.
>>>
>>
>>
>

Re: s3a file system and spark deployment mode

Posted by Scott Reynolds <sr...@twilio.com>.

We do not use EMR. This is deployed on Amazon VMs

We build Spark with Hadoop-2.6.0 but that does not include the s3a
filesystem nor the Amazon AWS SDK

On Thu, Oct 15, 2015 at 12:26 PM, Spark Newbie <sp...@gmail.com>
wrote:

> Are you using EMR?
> You can install Hadoop-2.6.0 along with Spark-1.5.1 in your EMR cluster.
> And that brings s3a jars to the worker nodes and it becomes available to
> your application.
>
> On Thu, Oct 15, 2015 at 11:04 AM, Scott Reynolds <sr...@twilio.com>
> wrote:
>
>> List,
>>
>> Right now we build our spark jobs with the s3a hadoop client. We do this
>> because our machines are only allowed to use IAM access to the s3 store. We
>> can build our jars with the s3a filesystem and the aws sdk just fine and
>> this jars run great in *client mode*.
>>
>> We would like to move from client mode to cluster mode as that will allow
>> us to be more resilient to driver failure. In order to do this either:
>> 1. the jar file has to be on worker's local disk
>> 2. the jar file is in shared storage (s3a)
>>
>> We would like to put the jar file in s3 storage, but when we give the jar
>> path as s3a://......, the worker node doesn't have the hadoop s3a and aws
>> sdk in its classpath / uber jar.
>>
>> Other then building spark with those two dependencies, what other options
>> do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a thing.
>>
>> Need to get s3a access to both the master (so that we can log spark event
>> log to s3) and to the worker processes (driver, executor).
>>
>> Looking for ideas before just adding the dependencies to our spark build
>> and calling it a day.
>>
>
>

Re: s3a file system and spark deployment mode

Posted by Spark Newbie <sp...@gmail.com>.

Are you using EMR?
You can install Hadoop-2.6.0 along with Spark-1.5.1 in your EMR cluster.
And that brings s3a jars to the worker nodes and it becomes available to
your application.

On Thu, Oct 15, 2015 at 11:04 AM, Scott Reynolds <sr...@twilio.com>
wrote:

> List,
>
> Right now we build our spark jobs with the s3a hadoop client. We do this
> because our machines are only allowed to use IAM access to the s3 store. We
> can build our jars with the s3a filesystem and the aws sdk just fine and
> this jars run great in *client mode*.
>
> We would like to move from client mode to cluster mode as that will allow
> us to be more resilient to driver failure. In order to do this either:
> 1. the jar file has to be on worker's local disk
> 2. the jar file is in shared storage (s3a)
>
> We would like to put the jar file in s3 storage, but when we give the jar
> path as s3a://......, the worker node doesn't have the hadoop s3a and aws
> sdk in its classpath / uber jar.
>
> Other then building spark with those two dependencies, what other options
> do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a thing.
>
> Need to get s3a access to both the master (so that we can log spark event
> log to s3) and to the worker processes (driver, executor).
>
> Looking for ideas before just adding the dependencies to our spark build
> and calling it a day.
>