You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Benjamin Kim <bb...@gmail.com> on 2017/02/12 04:28:58 UTC

Remove dependence on HDFS

Has anyone got some advice on how to remove the reliance on HDFS for storing persistent data. We have an on-premise Spark cluster. It seems like a waste of resources to keep adding nodes because of a lack of storage space only. I would rather add more powerful nodes due to the lack of processing power at a less frequent rate, than add less powerful nodes at a more frequent rate just to handle the ever growing data. Can anyone point me in the right direction? Is Alluxio a good solution? S3? I would like to hear your thoughts.

Cheers,
Ben 
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Remove dependence on HDFS

Posted by Jörn Franke <jo...@gmail.com>.

You're have to carefully choose if your strategy makes sense given your users workloads. Hence, I am not sure your reasoning makes sense.

However, You can , for example, install openstack swift  as an object store and use this as storage. HDFS in this case can be used as a temporary store and/or for checkpointing. Alternatively you can do this fully in-memory with ignite or alluxio.

S3 is the cloud storage provided by Amazon - this is not on premise. You can do the same here as a described above, but using s3 instead of swift.

> On 12 Feb 2017, at 05:28, Benjamin Kim <bb...@gmail.com> wrote:
> 
> Has anyone got some advice on how to remove the reliance on HDFS for storing persistent data. We have an on-premise Spark cluster. It seems like a waste of resources to keep adding nodes because of a lack of storage space only. I would rather add more powerful nodes due to the lack of processing power at a less frequent rate, than add less powerful nodes at a more frequent rate just to handle the ever growing data. Can anyone point me in the right direction? Is Alluxio a good solution? S3? I would like to hear your thoughts.
> 
> Cheers,
> Ben 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Remove dependence on HDFS

Posted by Calvin Jia <ji...@gmail.com>.

Hi Ben,

You can replace HDFS with a number of storage systems since Spark is
compatible with other storage like S3. This would allow you to scale your
compute nodes solely for the purpose of adding compute power and not disk
space. You can deploy Alluxio on your compute nodes to offset the
performance impact of decoupling your compute and storage, as well as unify
multiple storage spaces if you would like to still use HDFS, S3, and/or
other storage solutions in tandem. Here is an article
<https://alluxio.com/blog/accelerating-data-analytics-on-ceph-object-storage-with-alluxio>
which describes a similar architecture.

Hope this helps,
Calvin

On Mon, Feb 13, 2017 at 12:46 AM, Saisai Shao <sa...@gmail.com>
wrote:

> IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem
> layer which supports different FS implementations, HDFS is just one option.
> You could also use S3 as a backend FS, from Spark's point it is transparent
> to different FS implementations.
>
>
>
> On Sun, Feb 12, 2017 at 5:32 PM, ayan guha <gu...@gmail.com> wrote:
>
>> How about adding more NFS storage?
>>
>> On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen <so...@cloudera.com> wrote:
>>
>>> Data has to live somewhere -- how do you not add storage but store more
>>> data?  Alluxio is not persistent storage, and S3 isn't on your premises.
>>>
>>> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim <bb...@gmail.com> wrote:
>>>
>>> Has anyone got some advice on how to remove the reliance on HDFS for
>>> storing persistent data. We have an on-premise Spark cluster. It seems like
>>> a waste of resources to keep adding nodes because of a lack of storage
>>> space only. I would rather add more powerful nodes due to the lack of
>>> processing power at a less frequent rate, than add less powerful nodes at a
>>> more frequent rate just to handle the ever growing data. Can anyone point
>>> me in the right direction? Is Alluxio a good solution? S3? I would like to
>>> hear your thoughts.
>>>
>>> Cheers,
>>> Ben
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: Remove dependence on HDFS

Posted by Saisai Shao <sa...@gmail.com>.

IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem layer
which supports different FS implementations, HDFS is just one option. You
could also use S3 as a backend FS, from Spark's point it is transparent to
different FS implementations.



On Sun, Feb 12, 2017 at 5:32 PM, ayan guha <gu...@gmail.com> wrote:

> How about adding more NFS storage?
>
> On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen <so...@cloudera.com> wrote:
>
>> Data has to live somewhere -- how do you not add storage but store more
>> data?  Alluxio is not persistent storage, and S3 isn't on your premises.
>>
>> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim <bb...@gmail.com> wrote:
>>
>> Has anyone got some advice on how to remove the reliance on HDFS for
>> storing persistent data. We have an on-premise Spark cluster. It seems like
>> a waste of resources to keep adding nodes because of a lack of storage
>> space only. I would rather add more powerful nodes due to the lack of
>> processing power at a less frequent rate, than add less powerful nodes at a
>> more frequent rate just to handle the ever growing data. Can anyone point
>> me in the right direction? Is Alluxio a good solution? S3? I would like to
>> hear your thoughts.
>>
>> Cheers,
>> Ben
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>> --
> Best Regards,
> Ayan Guha
>

Re: Remove dependence on HDFS

Posted by ayan guha <gu...@gmail.com>.

How about adding more NFS storage?

On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen <so...@cloudera.com> wrote:

> Data has to live somewhere -- how do you not add storage but store more
> data?  Alluxio is not persistent storage, and S3 isn't on your premises.
>
> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim <bb...@gmail.com> wrote:
>
> Has anyone got some advice on how to remove the reliance on HDFS for
> storing persistent data. We have an on-premise Spark cluster. It seems like
> a waste of resources to keep adding nodes because of a lack of storage
> space only. I would rather add more powerful nodes due to the lack of
> processing power at a less frequent rate, than add less powerful nodes at a
> more frequent rate just to handle the ever growing data. Can anyone point
> me in the right direction? Is Alluxio a good solution? S3? I would like to
> hear your thoughts.
>
> Cheers,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
Best Regards,
Ayan Guha

Re: Remove dependence on HDFS

Posted by Sean Owen <so...@cloudera.com>.

Data has to live somewhere -- how do you not add storage but store more
data?  Alluxio is not persistent storage, and S3 isn't on your premises.

On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim <bb...@gmail.com> wrote:

> Has anyone got some advice on how to remove the reliance on HDFS for
> storing persistent data. We have an on-premise Spark cluster. It seems like
> a waste of resources to keep adding nodes because of a lack of storage
> space only. I would rather add more powerful nodes due to the lack of
> processing power at a less frequent rate, than add less powerful nodes at a
> more frequent rate just to handle the ever growing data. Can anyone point
> me in the right direction? Is Alluxio a good solution? S3? I would like to
> hear your thoughts.
>
> Cheers,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>