You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Manoj Samel <ma...@gmail.com> on 2014/01/22 21:37:22 UTC

How to use cluster for large set of linux files

I have a set of csv files that I want to read as a single RDD using a stand
alone cluster.

These file reside on one machine right now. If I start a cluster with
multiple worker nodes, how do I use these worker nodes to read the files
and do the RDD computation ? Do I have to copy the files on every worker
node ?

Assume that copying these into a HDFS is not a option for now ..

Thanks,

Re: How to use cluster for large set of linux files

Posted by Manoj Samel <ma...@gmail.com>.

Thanks Ognen, HDFS is the plan; I am just hitting a a issue when building
for HDFS hence local files for now.


On Wed, Jan 22, 2014 at 1:03 PM, Ognen Duzlevski <
ognen@plainvanillagames.com> wrote:

> Manoj,
>
> large is a relative term ;)
>
> NFS is a rather slow solution, at least that's always been my experience.
> However, it will work for smaller files.
>
> One way to do it is to put the files in S3 on Amazon. However, then your
> network becomes a limiting factor.
>
> The other way to do it is to replicate all the files on each node but that
> can get tedious and depends on how much disk space you have, may not be an
> option.
>
> Finally there are things like http://code.google.com/p/mogilefs/ but they
> seem to need a special library to read a file - probably would need some
> kind of patching of spark to make it work since it may not expose the usual
> filesystem interface. However, it could be a viable solution, I am just
> starting to play with it.
>
> Ognen
>
>
> On Wed, Jan 22, 2014 at 8:37 PM, Manoj Samel <ma...@gmail.com>wrote:
>
>> I have a set of csv files that I want to read as a single RDD using a
>> stand alone cluster.
>>
>> These file reside on one machine right now. If I start a cluster with
>> multiple worker nodes, how do I use these worker nodes to read the files
>> and do the RDD computation ? Do I have to copy the files on every worker
>> node ?
>>
>> Assume that copying these into a HDFS is not a option for now ..
>>
>> Thanks,
>>
>
>
>
> --
> "Le secret des grandes fortunes sans cause apparente est un crime oublié,
> parce qu'il a été proprement fait" - Honore de Balzac
>

Re: How to use cluster for large set of linux files

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.

Manoj,

large is a relative term ;)

NFS is a rather slow solution, at least that's always been my experience.
However, it will work for smaller files.

One way to do it is to put the files in S3 on Amazon. However, then your
network becomes a limiting factor.

The other way to do it is to replicate all the files on each node but that
can get tedious and depends on how much disk space you have, may not be an
option.

Finally there are things like http://code.google.com/p/mogilefs/ but they
seem to need a special library to read a file - probably would need some
kind of patching of spark to make it work since it may not expose the usual
filesystem interface. However, it could be a viable solution, I am just
starting to play with it.

Ognen

On Wed, Jan 22, 2014 at 8:37 PM, Manoj Samel <ma...@gmail.com>wrote:

> I have a set of csv files that I want to read as a single RDD using a
> stand alone cluster.
>
> These file reside on one machine right now. If I start a cluster with
> multiple worker nodes, how do I use these worker nodes to read the files
> and do the RDD computation ? Do I have to copy the files on every worker
> node ?
>
> Assume that copying these into a HDFS is not a option for now ..
>
> Thanks,
>

-- 
"Le secret des grandes fortunes sans cause apparente est un crime oublié,
parce qu'il a été proprement fait" - Honore de Balzac

Re: How to use cluster for large set of linux files

Posted by Matei Zaharia <ma...@gmail.com>.

When you do foreach(println) on a cluster, that calls println *on the worker nodes*, so the output goes to their stdout and stderr files instead of to your shell. To make sure it loaded the file you should use operations that return the data locally, like .first() or .take().

Matei

On Jan 22, 2014, at 12:56 PM, Manoj Samel <ma...@gmail.com> wrote:

> Thanks Matei.
> 
> One thing I noticed after doing this and starting MASTER=spark://xxxx spark-shell is everything works , BUT the xxx.foreach(println) prints blank line. All other logic seems working. If I do a xx.count etc, I can see the value, just the println does not seems working
> 
> 
> On Wed, Jan 22, 2014 at 12:39 PM, Matei Zaharia <ma...@gmail.com> wrote:
> Hi Manoj,
> 
> You’d have to make the files available at the same path on each machine through something like NFS. You don’t need to copy them, though that would also work.
> 
> Matei
> 
> On Jan 22, 2014, at 12:37 PM, Manoj Samel <ma...@gmail.com> wrote:
> 
> > I have a set of csv files that I want to read as a single RDD using a stand alone cluster.
> >
> > These file reside on one machine right now. If I start a cluster with multiple worker nodes, how do I use these worker nodes to read the files and do the RDD computation ? Do I have to copy the files on every worker node ?
> >
> > Assume that copying these into a HDFS is not a option for now ..
> >
> > Thanks,
> 
>

Re: How to use cluster for large set of linux files

Posted by Manoj Samel <ma...@gmail.com>.

Thanks Matei.

One thing I noticed after doing this and starting MASTER=spark://xxxx
spark-shell is everything works , BUT the xxx.foreach(println) prints blank
line. All other logic seems working. If I do a xx.count etc, I can see the
value, just the println does not seems working

On Wed, Jan 22, 2014 at 12:39 PM, Matei Zaharia <ma...@gmail.com>wrote:

> Hi Manoj,
>
> You’d have to make the files available at the same path on each machine
> through something like NFS. You don’t need to copy them, though that would
> also work.
>
> Matei
>
> On Jan 22, 2014, at 12:37 PM, Manoj Samel <ma...@gmail.com>
> wrote:
>
> > I have a set of csv files that I want to read as a single RDD using a
> stand alone cluster.
> >
> > These file reside on one machine right now. If I start a cluster with
> multiple worker nodes, how do I use these worker nodes to read the files
> and do the RDD computation ? Do I have to copy the files on every worker
> node ?
> >
> > Assume that copying these into a HDFS is not a option for now ..
> >
> > Thanks,
>
>

Re: How to use cluster for large set of linux files

Posted by Matei Zaharia <ma...@gmail.com>.

Hi Manoj,

You’d have to make the files available at the same path on each machine through something like NFS. You don’t need to copy them, though that would also work.

Matei

On Jan 22, 2014, at 12:37 PM, Manoj Samel <ma...@gmail.com> wrote:

> I have a set of csv files that I want to read as a single RDD using a stand alone cluster. 
> 
> These file reside on one machine right now. If I start a cluster with multiple worker nodes, how do I use these worker nodes to read the files and do the RDD computation ? Do I have to copy the files on every worker node ?
> 
> Assume that copying these into a HDFS is not a option for now ..
> 
> Thanks,