You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Andy Davidson <An...@SantaCruzIntegration.com> on 2017/11/22 18:21:02 UTC

newbie: how to partition data on file system. What are best practices?

I am working on a deep learning project. Currently we do everything on a
single machine. I am trying to figure out how we might be able to move to a
clustered spark environment.

Clearly its possible a machine or job on the cluster might fail so I assume
that the data needs to be replicated to some degree.

Eventually I expect to I will need to process multi petabyte files and will
need to come up with some sort of sharding. Communication costs could be a
problem. Does spark have any knowledge of how the data distributed,
replicated across the machine in my cluster?

Let say my data source is S3. I should I copy the data to my ec2 cluster or
try to read directly from S3?

If our pilot is successful we expect to need to process multi petabyte file.

What are best practices?

Kind regards

Andy

P.s. We expect to use AWS or some other cloud solution.