You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ashish Mukherjee <as...@gmail.com> on 2015/03/05 14:58:22 UTC

Spark with data on NFS v HDFS

Hello,

I understand Spark can be used with Hadoop or standalone. I have certain
questions related to use of the correct FS for Spark data.

What is the efficiency trade-off in feeding data to Spark from NFS v HDFS?

If one is not using Hadoop, is it still usual to house data in HDFS for
Spark to read from because of better reliability compared to NFS?

Should data be stored on local FS (not NFS) only for Spark jobs which run
on single machine?

Regards,
Ashish

Re: Spark with data on NFS v HDFS

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

On Thu, Mar 5, 2015 at 10:58 PM, Ashish Mukherjee <
ashish.mukherjee@gmail.com> wrote:
>
> I understand Spark can be used with Hadoop or standalone. I have certain
> questions related to use of the correct FS for Spark data.
>
> What is the efficiency trade-off in feeding data to Spark from NFS v HDFS?
>

As I understand it, one performance advantage of using HDFS is that the
task will be computed at a cluster node that has data on its local disk
already, so the tasks go to where the data is. In the case of NFS, all data
must be downloaded from the file server(s) first, so there is no such thing
as "data locality".

Tobias