You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Zeming Yu <ze...@gmail.com> on 2017/04/11 10:07:08 UTC

optimising storage and ec2 instances

Hi all,

I'm a beginner with spark, and I'm wondering if someone could provide
guidance on the following 2 questions I have.

Background: I have a data set growing by 6 TB p.a. I plan to use spark to
read in all the data, manipulate it and build a predictive model on it (say
GBM) I plan to store the data in S3, and use EMR to launch spark, reading
in data from S3.

1. Which option is best for storing the data on S3 for the purpose of
analysing it in EMR spark?
Option A: storing the 6TB file as 173 million individual text files
Option B: zipping up the above 173 million text files as 240,000 zip files
Option C: appending the individual text files, so have 240,000 text files
p.a.
Option D: combining the text files even further

2. Any recommendations on the EMR set up to analyse the 6TB of data all at
once and build a GBM, in terms of
1) The type of EC2 instances I need?
2) The number of such instances I need?
3) Rough estimate of cost?


Thanks so much,
Zeming

Re: optimising storage and ec2 instances

Posted by Sam Elamin <hu...@gmail.com>.
Hi Zeming Yu, Steve

Just to add, we are also going down partitioning using this route but you
should know if you are in AWS land, you are most likely going to use EMRs
at any given time

At the moment EMRs does not do recursive search on wildcards, see this
<http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-troubleshoot-errors-io.html#recurseinput>

However Spark seems to be able to deal with it fine, so if you dont have a
data serving layer to your customers then you should be fine

Regards
sam

On Tue, Apr 11, 2017 at 1:21 PM, Zeming Yu <ze...@gmail.com> wrote:

> everything works best if your sources are a few tens to hundreds of MB or
> more
>
> Are you referring to the size of the zip file or individual unzipped files?
>
> Any issues with storing a 60 mb zipped file containing heaps of text files
> inside?
>
> On 11 Apr. 2017 9:09 pm, "Steve Loughran" <st...@hortonworks.com> wrote:
>
>>
>> > On 11 Apr 2017, at 11:07, Zeming Yu <ze...@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I'm a beginner with spark, and I'm wondering if someone could provide
>> guidance on the following 2 questions I have.
>> >
>> > Background: I have a data set growing by 6 TB p.a. I plan to use spark
>> to read in all the data, manipulate it and build a predictive model on it
>> (say GBM) I plan to store the data in S3, and use EMR to launch spark,
>> reading in data from S3.
>> >
>> > 1. Which option is best for storing the data on S3 for the purpose of
>> analysing it in EMR spark?
>> > Option A: storing the 6TB file as 173 million individual text files
>> > Option B: zipping up the above 173 million text files as 240,000 zip
>> files
>> > Option C: appending the individual text files, so have 240,000 text
>> files p.a.
>> > Option D: combining the text files even further
>> >
>>
>> everything works best if your sources are a few tens to hundreds of MB or
>> more of your data, work can be partitioned up by file. If you use more
>> structured formats (avro compressed with snappy, etc), you can throw > 1
>> executor at work inside a file. Structure is handy all round, even if its
>> just adding timestamp and provenance columns to each data file.
>>
>> there's the HAR file format from Hadoop which can merge lots of small
>> files into larger ones, allowing work to be scheduled per har file.
>> Recommended for HDFS as it hates small files, on S3 you still have limits
>> on small files (including throttling of HTTP requests to shards of a
>> bucket), but they are less significant.
>>
>> One thing to be aware is that the s3 clients spark use are very
>> inefficient in listing wide directory trees, and Spark not always the best
>> at partitioning work because of this. You can accidentally create a very
>> inefficient tree structure like datasets/year=2017/month=5/day=10/hour=12/,
>> with only one file per hour. Listing and partitioning suffers here, and
>> while s3a on Hadoop 2.8 is better here, Spark hasn't yet fully adapted to
>> those changes (use of specific API calls). There's also a lot more to be
>> done in S3A to handle wildcards in the directory tree much more efficiently
>> (HADOOP-13204); needs to address pattens like (datasets/year=201?/month=*/day=10)
>> without treewalking and without fetching too much data from wildcards near
>> the top of the tree. We need to avoid implementing something which works
>> well on *my* layouts, but absolutely dies on other people's. As is usual in
>> OSS, help welcome; early testing here as critical as coding, so as to
>> ensure things will work with your file structures
>>
>> -Steve
>>
>>
>> > 2. Any recommendations on the EMR set up to analyse the 6TB of data all
>> at once and build a GBM, in terms of
>> > 1) The type of EC2 instances I need?
>> > 2) The number of such instances I need?
>> > 3) Rough estimate of cost?
>> >
>>
>> no opinion there
>>
>> >
>> > Thanks so much,
>> > Zeming
>> >
>>
>>

Re: optimising storage and ec2 instances

Posted by Zeming Yu <ze...@gmail.com>.
everything works best if your sources are a few tens to hundreds of MB or
more

Are you referring to the size of the zip file or individual unzipped files?

Any issues with storing a 60 mb zipped file containing heaps of text files
inside?

On 11 Apr. 2017 9:09 pm, "Steve Loughran" <st...@hortonworks.com> wrote:

>
> > On 11 Apr 2017, at 11:07, Zeming Yu <ze...@gmail.com> wrote:
> >
> > Hi all,
> >
> > I'm a beginner with spark, and I'm wondering if someone could provide
> guidance on the following 2 questions I have.
> >
> > Background: I have a data set growing by 6 TB p.a. I plan to use spark
> to read in all the data, manipulate it and build a predictive model on it
> (say GBM) I plan to store the data in S3, and use EMR to launch spark,
> reading in data from S3.
> >
> > 1. Which option is best for storing the data on S3 for the purpose of
> analysing it in EMR spark?
> > Option A: storing the 6TB file as 173 million individual text files
> > Option B: zipping up the above 173 million text files as 240,000 zip
> files
> > Option C: appending the individual text files, so have 240,000 text
> files p.a.
> > Option D: combining the text files even further
> >
>
> everything works best if your sources are a few tens to hundreds of MB or
> more of your data, work can be partitioned up by file. If you use more
> structured formats (avro compressed with snappy, etc), you can throw > 1
> executor at work inside a file. Structure is handy all round, even if its
> just adding timestamp and provenance columns to each data file.
>
> there's the HAR file format from Hadoop which can merge lots of small
> files into larger ones, allowing work to be scheduled per har file.
> Recommended for HDFS as it hates small files, on S3 you still have limits
> on small files (including throttling of HTTP requests to shards of a
> bucket), but they are less significant.
>
> One thing to be aware is that the s3 clients spark use are very
> inefficient in listing wide directory trees, and Spark not always the best
> at partitioning work because of this. You can accidentally create a very
> inefficient tree structure like datasets/year=2017/month=5/day=10/hour=12/,
> with only one file per hour. Listing and partitioning suffers here, and
> while s3a on Hadoop 2.8 is better here, Spark hasn't yet fully adapted to
> those changes (use of specific API calls). There's also a lot more to be
> done in S3A to handle wildcards in the directory tree much more efficiently
> (HADOOP-13204); needs to address pattens like (datasets/year=201?/month=*/day=10)
> without treewalking and without fetching too much data from wildcards near
> the top of the tree. We need to avoid implementing something which works
> well on *my* layouts, but absolutely dies on other people's. As is usual in
> OSS, help welcome; early testing here as critical as coding, so as to
> ensure things will work with your file structures
>
> -Steve
>
>
> > 2. Any recommendations on the EMR set up to analyse the 6TB of data all
> at once and build a GBM, in terms of
> > 1) The type of EC2 instances I need?
> > 2) The number of such instances I need?
> > 3) Rough estimate of cost?
> >
>
> no opinion there
>
> >
> > Thanks so much,
> > Zeming
> >
>
>

Re: optimising storage and ec2 instances

Posted by Steve Loughran <st...@hortonworks.com>.
> On 11 Apr 2017, at 11:07, Zeming Yu <ze...@gmail.com> wrote:
> 
> Hi all,
> 
> I'm a beginner with spark, and I'm wondering if someone could provide guidance on the following 2 questions I have.
> 
> Background: I have a data set growing by 6 TB p.a. I plan to use spark to read in all the data, manipulate it and build a predictive model on it (say GBM) I plan to store the data in S3, and use EMR to launch spark, reading in data from S3.
> 
> 1. Which option is best for storing the data on S3 for the purpose of analysing it in EMR spark?
> Option A: storing the 6TB file as 173 million individual text files
> Option B: zipping up the above 173 million text files as 240,000 zip files
> Option C: appending the individual text files, so have 240,000 text files p.a.
> Option D: combining the text files even further
> 

everything works best if your sources are a few tens to hundreds of MB or more of your data, work can be partitioned up by file. If you use more structured formats (avro compressed with snappy, etc), you can throw > 1 executor at work inside a file. Structure is handy all round, even if its just adding timestamp and provenance columns to each data file.

there's the HAR file format from Hadoop which can merge lots of small files into larger ones, allowing work to be scheduled per har file. Recommended for HDFS as it hates small files, on S3 you still have limits on small files (including throttling of HTTP requests to shards of a bucket), but they are less significant.

One thing to be aware is that the s3 clients spark use are very inefficient in listing wide directory trees, and Spark not always the best at partitioning work because of this. You can accidentally create a very inefficient tree structure like datasets/year=2017/month=5/day=10/hour=12/, with only one file per hour. Listing and partitioning suffers here, and while s3a on Hadoop 2.8 is better here, Spark hasn't yet fully adapted to those changes (use of specific API calls). There's also a lot more to be done in S3A to handle wildcards in the directory tree much more efficiently (HADOOP-13204); needs to address pattens like (datasets/year=201?/month=*/day=10) without treewalking and without fetching too much data from wildcards near the top of the tree. We need to avoid implementing something which works well on *my* layouts, but absolutely dies on other people's. As is usual in OSS, help welcome; early testing here as critical as coding, so as to ensure things will work with your file structures

-Steve


> 2. Any recommendations on the EMR set up to analyse the 6TB of data all at once and build a GBM, in terms of
> 1) The type of EC2 instances I need?
> 2) The number of such instances I need?
> 3) Rough estimate of cost?
> 

no opinion there

> 
> Thanks so much,
> Zeming
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org