You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aureliano Buendia <bu...@gmail.com> on 2014/04/16 19:59:29 UTC

Using google cloud storage for spark big data

Hi,

Google has publisheed a new connector for hadoop: google cloud storage,
which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html

How can spark be configured to use this connector?

Re: Using google cloud storage for spark big data

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Hi Aureliano,

You might want to check this script out,
https://github.com/sigmoidanalytics/spark_gce
Let me know if you need any help around that.

Thanks
Best Regards


On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia <bu...@gmail.com>wrote:

>
>
>
> On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <
> andras.nemeth@lynxanalytics.com> wrote:
>
>> We don't have anything fancy. It's basically some very thin layer of
>> google specifics on top of a stand alone cluster. We basically created two
>> disk snapshots, one for the master and one for the workers. The snapshots
>> contain initialization scripts so that the master/worker daemons are
>> started on boot. So if I want a cluster I just create a new instance (with
>> a fixed name) using the master snapshot for the master. When it is up I
>> start as many slave instances as I need using the slave snapshot. By the
>> time the machines are up the cluster is ready to be used.
>>
>>
> This sounds like being a lot simpler than the existing spark-ec2 script.
> Does google compute engine api makes this happen in a simple way, when
> compared to ec2 api? Does your script do everything spark-ec2 does?
>
> Also, any plans to make this open source?
>
>
>> Andras
>>
>>
>>
>> On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> Okay just commented on another thread :)
>>> I have one that I use internally. Can give it out but will need some
>>> support from you to fix bugs etc. Let me know if you are interested.
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <buendia360@gmail.com
>>> > wrote:
>>>
>>>> Thanks, Andras. What approach did you use to setup a spark cluster on
>>>> google compute engine? Currently, there is no production-ready official
>>>> support for an equivalent of spark-ec2 on gce. Did you roll your own?
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
>>>> andras.nemeth@lynxanalytics.com> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <
>>>>> buendia360@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Google has publisheed a new connector for hadoop: google cloud
>>>>>> storage, which is an equivalent of amazon s3:
>>>>>>
>>>>>>
>>>>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>>>>>
>>>>> This is actually about Cloud Datastore and not Cloud Storage (yeah,
>>>>> quite confusing naming ;) ). But they do already have for a while a cloud
>>>>> storage connector, also linked from your article:
>>>>> https://developers.google.com/hadoop/google-cloud-storage-connector
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> How can spark be configured to use this connector?
>>>>>>
>>>>> Yes, it can, but in a somewhat hacky way. The problem is that for some
>>>>> reason Google does not officially publish the library jar alone, you get it
>>>>> installed as part of a Hadoop on Google Cloud installation. So, the
>>>>> official way would be (we did not try that) to have a Hadoop on Google
>>>>> Cloud installation and run spark on top of that.
>>>>>
>>>>> The other option - that we did try and which works fine for us - is to
>>>>> snatch the jar:
>>>>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
>>>>> make sure it's shipped to your workers (e.g. with setJars on SparkConf when
>>>>> you create your SparkContext). Then create a core-site.xml file which you
>>>>> make sure is on the classpath both in your driver and your cluster (e.g.
>>>>> you can make sure it ends up in one of the jars you send with setJars
>>>>> above) with this content (with YOUR_* replaced):
>>>>> <configuration>
>>>>>
>>>>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>>>>>   <property><name>fs.gs.project.id
>>>>> </name><value>YOUR_PROJECT_ID</value></property>
>>>>>
>>>>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
>>>>> </configuration>
>>>>>
>>>>> From this point on you can simply use gs://... filenames to read/write
>>>>> data on Cloud Storage.
>>>>>
>>>>> Note that you should run your cluster and driver program on Google
>>>>> Compute Engine for this to work as is. Probably it's possible to configure
>>>>> access from the outside too but we didn't do that.
>>>>>
>>>>> Hope this helps,
>>>>> Andras
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Using google cloud storage for spark big data

Posted by Aureliano Buendia <bu...@gmail.com>.
On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <
andras.nemeth@lynxanalytics.com> wrote:

> We don't have anything fancy. It's basically some very thin layer of
> google specifics on top of a stand alone cluster. We basically created two
> disk snapshots, one for the master and one for the workers. The snapshots
> contain initialization scripts so that the master/worker daemons are
> started on boot. So if I want a cluster I just create a new instance (with
> a fixed name) using the master snapshot for the master. When it is up I
> start as many slave instances as I need using the slave snapshot. By the
> time the machines are up the cluster is ready to be used.
>
>
This sounds like being a lot simpler than the existing spark-ec2 script.
Does google compute engine api makes this happen in a simple way, when
compared to ec2 api? Does your script do everything spark-ec2 does?

Also, any plans to make this open source?


> Andras
>
>
>
> On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> Okay just commented on another thread :)
>> I have one that I use internally. Can give it out but will need some
>> support from you to fix bugs etc. Let me know if you are interested.
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <bu...@gmail.com>wrote:
>>
>>> Thanks, Andras. What approach did you use to setup a spark cluster on
>>> google compute engine? Currently, there is no production-ready official
>>> support for an equivalent of spark-ec2 on gce. Did you roll your own?
>>>
>>>
>>> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
>>> andras.nemeth@lynxanalytics.com> wrote:
>>>
>>>> Hello!
>>>>
>>>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <
>>>> buendia360@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Google has publisheed a new connector for hadoop: google cloud
>>>>> storage, which is an equivalent of amazon s3:
>>>>>
>>>>>
>>>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>>>>
>>>> This is actually about Cloud Datastore and not Cloud Storage (yeah,
>>>> quite confusing naming ;) ). But they do already have for a while a cloud
>>>> storage connector, also linked from your article:
>>>> https://developers.google.com/hadoop/google-cloud-storage-connector
>>>>
>>>>
>>>>>
>>>>>
>>>>> How can spark be configured to use this connector?
>>>>>
>>>> Yes, it can, but in a somewhat hacky way. The problem is that for some
>>>> reason Google does not officially publish the library jar alone, you get it
>>>> installed as part of a Hadoop on Google Cloud installation. So, the
>>>> official way would be (we did not try that) to have a Hadoop on Google
>>>> Cloud installation and run spark on top of that.
>>>>
>>>> The other option - that we did try and which works fine for us - is to
>>>> snatch the jar:
>>>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
>>>> make sure it's shipped to your workers (e.g. with setJars on SparkConf when
>>>> you create your SparkContext). Then create a core-site.xml file which you
>>>> make sure is on the classpath both in your driver and your cluster (e.g.
>>>> you can make sure it ends up in one of the jars you send with setJars
>>>> above) with this content (with YOUR_* replaced):
>>>> <configuration>
>>>>
>>>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>>>>   <property><name>fs.gs.project.id
>>>> </name><value>YOUR_PROJECT_ID</value></property>
>>>>
>>>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
>>>> </configuration>
>>>>
>>>> From this point on you can simply use gs://... filenames to read/write
>>>> data on Cloud Storage.
>>>>
>>>> Note that you should run your cluster and driver program on Google
>>>> Compute Engine for this to work as is. Probably it's possible to configure
>>>> access from the outside too but we didn't do that.
>>>>
>>>> Hope this helps,
>>>> Andras
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Using google cloud storage for spark big data

Posted by Andras Nemeth <an...@lynxanalytics.com>.
We don't have anything fancy. It's basically some very thin layer of google
specifics on top of a stand alone cluster. We basically created two disk
snapshots, one for the master and one for the workers. The snapshots
contain initialization scripts so that the master/worker daemons are
started on boot. So if I want a cluster I just create a new instance (with
a fixed name) using the master snapshot for the master. When it is up I
start as many slave instances as I need using the slave snapshot. By the
time the machines are up the cluster is ready to be used.

Andras



On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> Okay just commented on another thread :)
> I have one that I use internally. Can give it out but will need some
> support from you to fix bugs etc. Let me know if you are interested.
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <bu...@gmail.com>wrote:
>
>> Thanks, Andras. What approach did you use to setup a spark cluster on
>> google compute engine? Currently, there is no production-ready official
>> support for an equivalent of spark-ec2 on gce. Did you roll your own?
>>
>>
>> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
>> andras.nemeth@lynxanalytics.com> wrote:
>>
>>> Hello!
>>>
>>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <buendia360@gmail.com
>>> > wrote:
>>>
>>>> Hi,
>>>>
>>>> Google has publisheed a new connector for hadoop: google cloud storage,
>>>> which is an equivalent of amazon s3:
>>>>
>>>>
>>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>>>
>>> This is actually about Cloud Datastore and not Cloud Storage (yeah,
>>> quite confusing naming ;) ). But they do already have for a while a cloud
>>> storage connector, also linked from your article:
>>> https://developers.google.com/hadoop/google-cloud-storage-connector
>>>
>>>
>>>>
>>>>
>>>> How can spark be configured to use this connector?
>>>>
>>> Yes, it can, but in a somewhat hacky way. The problem is that for some
>>> reason Google does not officially publish the library jar alone, you get it
>>> installed as part of a Hadoop on Google Cloud installation. So, the
>>> official way would be (we did not try that) to have a Hadoop on Google
>>> Cloud installation and run spark on top of that.
>>>
>>> The other option - that we did try and which works fine for us - is to
>>> snatch the jar:
>>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
>>> make sure it's shipped to your workers (e.g. with setJars on SparkConf when
>>> you create your SparkContext). Then create a core-site.xml file which you
>>> make sure is on the classpath both in your driver and your cluster (e.g.
>>> you can make sure it ends up in one of the jars you send with setJars
>>> above) with this content (with YOUR_* replaced):
>>> <configuration>
>>>
>>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>>>   <property><name>fs.gs.project.id
>>> </name><value>YOUR_PROJECT_ID</value></property>
>>>
>>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
>>> </configuration>
>>>
>>> From this point on you can simply use gs://... filenames to read/write
>>> data on Cloud Storage.
>>>
>>> Note that you should run your cluster and driver program on Google
>>> Compute Engine for this to work as is. Probably it's possible to configure
>>> access from the outside too but we didn't do that.
>>>
>>> Hope this helps,
>>> Andras
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Using google cloud storage for spark big data

Posted by Mayur Rustagi <ma...@gmail.com>.
Okay just commented on another thread :)
I have one that I use internally. Can give it out but will need some
support from you to fix bugs etc. Let me know if you are interested.

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <bu...@gmail.com>wrote:

> Thanks, Andras. What approach did you use to setup a spark cluster on
> google compute engine? Currently, there is no production-ready official
> support for an equivalent of spark-ec2 on gce. Did you roll your own?
>
>
> On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
> andras.nemeth@lynxanalytics.com> wrote:
>
>> Hello!
>>
>> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <bu...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> Google has publisheed a new connector for hadoop: google cloud storage,
>>> which is an equivalent of amazon s3:
>>>
>>>
>>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>>
>> This is actually about Cloud Datastore and not Cloud Storage (yeah, quite
>> confusing naming ;) ). But they do already have for a while a cloud storage
>> connector, also linked from your article:
>> https://developers.google.com/hadoop/google-cloud-storage-connector
>>
>>
>>>
>>>
>>> How can spark be configured to use this connector?
>>>
>> Yes, it can, but in a somewhat hacky way. The problem is that for some
>> reason Google does not officially publish the library jar alone, you get it
>> installed as part of a Hadoop on Google Cloud installation. So, the
>> official way would be (we did not try that) to have a Hadoop on Google
>> Cloud installation and run spark on top of that.
>>
>> The other option - that we did try and which works fine for us - is to
>> snatch the jar:
>> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
>> make sure it's shipped to your workers (e.g. with setJars on SparkConf when
>> you create your SparkContext). Then create a core-site.xml file which you
>> make sure is on the classpath both in your driver and your cluster (e.g.
>> you can make sure it ends up in one of the jars you send with setJars
>> above) with this content (with YOUR_* replaced):
>> <configuration>
>>
>> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>>   <property><name>fs.gs.project.id
>> </name><value>YOUR_PROJECT_ID</value></property>
>>
>> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
>> </configuration>
>>
>> From this point on you can simply use gs://... filenames to read/write
>> data on Cloud Storage.
>>
>> Note that you should run your cluster and driver program on Google
>> Compute Engine for this to work as is. Probably it's possible to configure
>> access from the outside too but we didn't do that.
>>
>> Hope this helps,
>> Andras
>>
>>
>>
>>
>>
>

Re: Using google cloud storage for spark big data

Posted by Aureliano Buendia <bu...@gmail.com>.
Thanks, Andras. What approach did you use to setup a spark cluster on
google compute engine? Currently, there is no production-ready official
support for an equivalent of spark-ec2 on gce. Did you roll your own?


On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <
andras.nemeth@lynxanalytics.com> wrote:

> Hello!
>
> On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <bu...@gmail.com>wrote:
>
>> Hi,
>>
>> Google has publisheed a new connector for hadoop: google cloud storage,
>> which is an equivalent of amazon s3:
>>
>>
>> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>>
> This is actually about Cloud Datastore and not Cloud Storage (yeah, quite
> confusing naming ;) ). But they do already have for a while a cloud storage
> connector, also linked from your article:
> https://developers.google.com/hadoop/google-cloud-storage-connector
>
>
>>
>>
>> How can spark be configured to use this connector?
>>
> Yes, it can, but in a somewhat hacky way. The problem is that for some
> reason Google does not officially publish the library jar alone, you get it
> installed as part of a Hadoop on Google Cloud installation. So, the
> official way would be (we did not try that) to have a Hadoop on Google
> Cloud installation and run spark on top of that.
>
> The other option - that we did try and which works fine for us - is to
> snatch the jar:
> https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar,
> make sure it's shipped to your workers (e.g. with setJars on SparkConf when
> you create your SparkContext). Then create a core-site.xml file which you
> make sure is on the classpath both in your driver and your cluster (e.g.
> you can make sure it ends up in one of the jars you send with setJars
> above) with this content (with YOUR_* replaced):
> <configuration>
>
> <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
>   <property><name>fs.gs.project.id
> </name><value>YOUR_PROJECT_ID</value></property>
>
> <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
> </configuration>
>
> From this point on you can simply use gs://... filenames to read/write
> data on Cloud Storage.
>
> Note that you should run your cluster and driver program on Google Compute
> Engine for this to work as is. Probably it's possible to configure access
> from the outside too but we didn't do that.
>
> Hope this helps,
> Andras
>
>
>
>
>

Re: Using google cloud storage for spark big data

Posted by Andras Nemeth <an...@lynxanalytics.com>.
Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <bu...@gmail.com>wrote:

> Hi,
>
> Google has publisheed a new connector for hadoop: google cloud storage,
> which is an equivalent of amazon s3:
>
>
> googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
>
This is actually about Cloud Datastore and not Cloud Storage (yeah, quite
confusing naming ;) ). But they do already have for a while a cloud storage
connector, also linked from your article:
https://developers.google.com/hadoop/google-cloud-storage-connector


>
>
> How can spark be configured to use this connector?
>
Yes, it can, but in a somewhat hacky way. The problem is that for some
reason Google does not officially publish the library jar alone, you get it
installed as part of a Hadoop on Google Cloud installation. So, the
official way would be (we did not try that) to have a Hadoop on Google
Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to
snatch the jar:
https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make
sure it's shipped to your workers (e.g. with setJars on SparkConf when you
create your SparkContext). Then create a core-site.xml file which you make
sure is on the classpath both in your driver and your cluster (e.g. you can
make sure it ends up in one of the jars you send with setJars above) with
this content (with YOUR_* replaced):
<configuration>

<property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
  <property><name>fs.gs.project.id
</name><value>YOUR_PROJECT_ID</value></property>

<property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
</configuration>

>From this point on you can simply use gs://... filenames to read/write data
on Cloud Storage.

Note that you should run your cluster and driver program on Google Compute
Engine for this to work as is. Probably it's possible to configure access
from the outside too but we didn't do that.

Hope this helps,
Andras