You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Brandon White <bw...@gmail.com> on 2015/07/09 08:35:50 UTC

S3 vs HDFS

Are there any significant performance differences between reading text
files from S3 and hdfs?

Re: S3 vs HDFS

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

S3 will obviously add a network lag, whereas in HDFS, if your spark
executors are running on the same data-nodes you have the advantage of data
locality.

Thanks
Best Regards

On Thu, Jul 9, 2015 at 12:05 PM, Brandon White <bw...@gmail.com>
wrote:

> Are there any significant performance differences between reading text
> files from S3 and hdfs?
>

Re: S3 vs HDFS

Posted by Steve Loughran <st...@hortonworks.com>.

On 11 Jul 2015, at 19:20, Aaron Davidson <il...@gmail.com>> wrote:

Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned.


aah, I wasn't going to introduce that complication.

As Aaron says, if you do multipart uploads to S3, you do get each part into its own block

What we don't have in the S3 REST APIs is determining the partition count, hence block size. Instead the block size reported to Spark is simply the value of a constant set in the configuration.


If you are trying to go multipart for performance:

1. you need to have a consistent block size across all your datasets
2. In your configurations, fs.s3n.multipart.uploads.block.size == fs.s3n.block.size
http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html

Re: S3 vs HDFS

Posted by Aaron Davidson <il...@gmail.com>.

Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.

On Sat, Jul 11, 2015 at 11:14 AM, Steve Loughran <st...@hortonworks.com>
wrote:

>  seek() is very, very expensive on s3, even short forward seeks. If your
> code does a lot of, it will kill performance. (forward seeks are better in
> s3a, which with Hadoop 2.3 is now something safe to use, and in the s3
> client that Amazon include in EMR), but its still sluggish.
>
>  The other killers are
>  -anything involving renaming files or directories
>  -copy operations
>  -listing lots of files.
>
>  Finally, S3 is HDD backed,1 file == 1 block. In HDFS while you can have
> >3 processes reading different replicas of the same block of a file —giving
> 3x the bandwidth, disk bandwidth from an s3 object will be shared by all
> readers. The more readers: the worse performance
>
>
>  On 9 Jul 2015, at 14:31, Daniel Darabos <da...@lynxanalytics.com>
> wrote:
>
>  I recommend testing it for yourself. Even if you have no application,
> you can just run the spark-ec2 script, log in, run spark-shell and try
> reading files from an S3 bucket and from hdfs://<master IP>:9000/. (This is
> the ephemeral HDFS cluster, which uses SSD.)
>
>  I just tested our application this way yesterday and found the SSD-based
> HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be
> locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or
> the HDFS client library and protocol are just better than the S3 versions
> (which is HTTP-based and uses some 6-year-old libraries).
>
> On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam <su...@sujee.net> wrote:
>
>> latency is much bigger for S3 (if that matters)
>> And with HDFS you'd get data-locality that will boost your app
>> performance.
>>
>>  I did some light experimenting on this.
>> see my presentation here for some benchmark numbers ..etc
>> http://www.slideshare.net/sujee/hadoop-to-sparkv2
>>  from slide# 34
>>
>>  cheers
>>  Sujee Maniyam (http://sujee.net |
>> http://www.linkedin.com/in/sujeemaniyam )
>>  teaching Spark
>> <http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature>
>>
>>
>> On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bw...@gmail.com>
>> wrote:
>>
>>> Are there any significant performance differences between reading text
>>> files from S3 and hdfs?
>>>
>>
>>
>
>

Re: S3 vs HDFS

Posted by Steve Loughran <st...@hortonworks.com>.

seek() is very, very expensive on s3, even short forward seeks. If your code does a lot of, it will kill performance. (forward seeks are better in s3a, which with Hadoop 2.3 is now something safe to use, and in the s3 client that Amazon include in EMR), but its still sluggish.

The other killers are
 -anything involving renaming files or directories
 -copy operations
 -listing lots of files.

Finally, S3 is HDD backed,1 file == 1 block. In HDFS while you can have >3 processes reading different replicas of the same block of a file —giving 3x the bandwidth, disk bandwidth from an s3 object will be shared by all readers. The more readers: the worse performance


On 9 Jul 2015, at 14:31, Daniel Darabos <da...@lynxanalytics.com>> wrote:

I recommend testing it for yourself. Even if you have no application, you can just run the spark-ec2 script, log in, run spark-shell and try reading files from an S3 bucket and from hdfs://<master IP>:9000/. (This is the ephemeral HDFS cluster, which uses SSD.)

I just tested our application this way yesterday and found the SSD-based HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or the HDFS client library and protocol are just better than the S3 versions (which is HTTP-based and uses some 6-year-old libraries).

On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam <su...@sujee.net>> wrote:
latency is much bigger for S3 (if that matters)
And with HDFS you'd get data-locality that will boost your app performance.

I did some light experimenting on this.
see my presentation here for some benchmark numbers ..etc
http://www.slideshare.net/sujee/hadoop-to-sparkv2
from slide# 34

cheers
Sujee Maniyam (http://sujee.net<http://sujee.net/> | http://www.linkedin.com/in/sujeemaniyam )
teaching Spark<http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature>

On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bw...@gmail.com>> wrote:
Are there any significant performance differences between reading text files from S3 and hdfs?

Re: S3 vs HDFS

Posted by Daniel Darabos <da...@lynxanalytics.com>.

I recommend testing it for yourself. Even if you have no application, you
can just run the spark-ec2 script, log in, run spark-shell and try reading
files from an S3 bucket and from hdfs://<master IP>:9000/. (This is the
ephemeral HDFS cluster, which uses SSD.)

I just tested our application this way yesterday and found the SSD-based
HDFS to outperform S3 by a factor of 2. I don't know the cause. It may be
locality like Akhil suggests, or SSD vs HDD (assuming S3 is HDD-backed). Or
the HDFS client library and protocol are just better than the S3 versions
(which is HTTP-based and uses some 6-year-old libraries).

On Thu, Jul 9, 2015 at 9:54 AM, Sujee Maniyam <su...@sujee.net> wrote:

> latency is much bigger for S3 (if that matters)
> And with HDFS you'd get data-locality that will boost your app performance.
>
> I did some light experimenting on this.
> see my presentation here for some benchmark numbers ..etc
> http://www.slideshare.net/sujee/hadoop-to-sparkv2
> from slide# 34
>
> cheers
> Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam
> )
> teaching Spark
> <http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature>
>
>
> On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bw...@gmail.com>
> wrote:
>
>> Are there any significant performance differences between reading text
>> files from S3 and hdfs?
>>
>
>

Re: S3 vs HDFS

Posted by Sujee Maniyam <su...@sujee.net>.

latency is much bigger for S3 (if that matters)
And with HDFS you'd get data-locality that will boost your app performance.

I did some light experimenting on this.
see my presentation here for some benchmark numbers ..etc
http://www.slideshare.net/sujee/hadoop-to-sparkv2
from slide# 34

cheers
Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam )
teaching Spark
<http://elephantscale.com/training/spark-for-developers/?utm_source=mailinglist&utm_medium=email&utm_campaign=signature>

On Wed, Jul 8, 2015 at 11:35 PM, Brandon White <bw...@gmail.com>
wrote:

> Are there any significant performance differences between reading text
> files from S3 and hdfs?
>