You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Afshin, Bardia" <Ba...@capitalone.com> on 2017/04/24 16:48:43 UTC

removing columns from file

Hi there,

I have a process that downloads thousands of files from s3 bucket, removes a set of columns from it, and upload it to s3.

S3 is currently not  the bottleneck, having a Single Master Node Spark instance is the bottleneck. One approach is to distribute the files on multiple Spark Master Node workers, that will make it faster.

Question:

1.       Is there a way to utilize master / slave node on Spark to distribute this downloading and processing of files – so it can say do 10 files at a time?

2.       Is there a way to scale workers with Spark downloading and processing files, even if they are all Single Master Node?

Thanks,
Bardia
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: removing columns from file

Posted by Steve Loughran <st...@hortonworks.com>.
On 28 Apr 2017, at 16:10, Anubhav Agarwal <an...@gmail.com>> wrote:

Are you using Spark's textFiles method? If so, go through this blog :-
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219


old/dated blog post.

If you get the Hadoop 2.8 binaries on your classpath, s3a does a full directory tree listing if you give it a simple path like "s3a://bucket/events". The example in that post was using a complex wildcard which hasn't yet been speeded up as it's pretty hard to do it in a way which works effectively everywhere.

Having all your data in 1 dir works nicely.


Anubhav

On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia <Ba...@capitalone.com>> wrote:
Hi there,

I have a process that downloads thousands of files from s3 bucket, removes a set of columns from it, and upload it to s3.

S3 is currently not  the bottleneck, having a Single Master Node Spark instance is the bottleneck. One approach is to distribute the files on multiple Spark Master Node workers, that will make it faster.

yes, > 1 worker and, if the work can be partitioned


Question:

1.       Is there a way to utilize master / slave node on Spark to distribute this downloading and processing of files – so it can say do 10 files at a time?


yes, they are called RDDs/Dataframes & Datasets


If you are doing all the processing on the spark driver, then you aren't really using spark much, more just processing them in Scala

To get a dataframe

val df = SparkSession.read.format("csv").load("s3a://bucket/data")

You now have a dataset on all files in the directory /data in the bucket, which will be partitioned how spark decides (which depends on: # of workers, compression format used and its splittability). Assuming you can configure the dataframe with the column structure, you can filter aggressively by selecting only those columns you want

val filteredDf = df.select("rental", "start_time")
filteredDf.save(hdfs://final/processed<s3a://final/processed>")

then, once you've got all the data done, copy them up to S3 via distcp

I'd recommend you start doing this with a small number of files locally, getting the code working, then see if you can use it with s3 as the source/dest of data, again, locally if you want (it's just slow), then move to in-EC2 for the bandwidth.

Bandwidth wise, there are some pretty major performance issues with the s3n connector, S3a in Hadoop 2.7+ works, with Hadoop 2.8 having a lot more speedupm, especially when using orc and parquet as a source, where there's a special "random access mode".

futrher reading
https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-spark/index.html

https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.14.1/bk_hdcloud-aws/content/s3-performance/index.html


2.       Is there a way to scale workers with Spark downloading and processing files, even if they are all Single Master Node?



I think there may be some terminology confusion here. You are going to have to have one process which is the spark driver: either on your client machine, deployed somewhere in the cluster via YARN/Mesos, or running on a static location withing a spark standalone cluster. Everything other than the driver process is a work, which will do the work.




Re: removing columns from file

Posted by Anubhav Agarwal <an...@gmail.com>.
Are you using Spark's textFiles method? If so, go through this blog :-
http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219

Anubhav

On Mon, Apr 24, 2017 at 12:48 PM, Afshin, Bardia <
Bardia.Afshin@capitalone.com> wrote:

> Hi there,
>
>
>
> I have a process that downloads thousands of files from s3 bucket, removes
> a set of columns from it, and upload it to s3.
>
>
>
> S3 is currently not  the bottleneck, having a Single Master Node Spark
> instance is the bottleneck. One approach is to distribute the files on
> multiple Spark Master Node workers, that will make it faster.
>
>
>
> Question:
>
> 1.       Is there a way to utilize master / slave node on Spark to
> distribute this downloading and processing of files – so it can say do 10
> files at a time?
>
> 2.       Is there a way to scale workers with Spark downloading and
> processing files, even if they are all Single Master Node?
>
>
>
> Thanks,
>
> Bardia
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>