You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "Lalwani, Jayesh" <Ja...@capitalone.com> on 2017/11/29 21:45:39 UTC

Leveraging S3 select

AWS announced at re:Invent that they are launching S3 Select. This can allow Spark to push down predicates to S3, rather than read the entire file in memory. Are there any plans to update Spark to use S3 Select?
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: Leveraging S3 select

Posted by Steve Loughran <st...@hortonworks.com>.

On 8 Dec 2017, at 17:05, Andrew Duffy <ad...@palantir.com>> wrote:

Hey Steve,

Happen to have a link to the TPC-DS benchmark data w/random S3 reads? I've done a decent amount of digging, but all I've found is a reference in a slide deck

Is that one of mine?

We haven't done any benchmarking with/without random IO for a while, as we've taken that as a given and worrying about the other aspects of the problem: speeding up directory listings & getFileStatus calls (used a lot in the serialized partitioning phase), and making direct commits of work to S3 both correct and performant.

I'm trying to sort out some benchmarking there, which involves: cherry picking the new changes to an internal release, building that, having someone who understands benchmarking set up a cluster and run the tests, which involves their time and the cost of the clusters. I say clusters as it'll inevitably involve playing with different VM options and some EMR clusters alongside(*).

One bit of fun there becomes the fact that different instances of the same cluster specs may give different numbers; it depends on actual CPUs allocated, network, neighbours. When we do publish some numbers, we do it from a single cluster instance, rather than doing "best per-test outcome on multiple clusters". Good to check if others do the same.

Otherwise: test with your own code & the Hadoop 2.8.1+ JARs; see what numbers you get. If you are using Parquet or ORC, I would not consider using the sequential IO code. At the same time, if you are working with CSV, Avro, gzip, you don't want to use it, because what would be a single file GET with some forward skips of read & discard of data is now a slow sequence of GETs with latency between each one.
HADOOP-14965<https://issues.apache.org/jira/browse/HADOOP-14965> (not yet committed) changes the default policy of an input stream to "switch to random IO mode on the first backwards seek", so you don't need to decide upfront that to use. There's the potential cost of the first HTTPS abort on that initial backwards seek, but after, random IO all the way. the Wasb client has been doing this for a while and everyone is happy, not least because its one less tuning option to document & test, and eliminates a whole class of support calls "client is fast to read .csv but not .orc files".

-Steve

(*) I have a hadoop branch-2.9 fork with the new committer stuff in if someone wants to compare numbers there. Bear in mind that the current RDD.write(s3a://something) command, when it uses the Hadoop FileOutputFormats and hence the FileOutputCommitter is not just observably a slow O(data) kind of operation, it is *not correct*, so the performance is just a detail. It's the one you notice, but not the issue to fear. Fixed by HADOOP-13786 & a bit of glue to keep Spark happy

Re: Leveraging S3 select

Posted by Andrew Duffy <ad...@palantir.com>.

Hey Steve,

Happen to have a link to the TPC-DS benchmark data w/random S3 reads? I've done a decent amount of digging, but all I've found is a reference in a slide deck and some jira tickets.

From: Steve Loughran <st...@hortonworks.com>
Date: Tuesday, December 5, 2017 at 09:44
To: "Lalwani, Jayesh" <Ja...@capitalone.com>
Cc: Apache Spark Dev <de...@spark.apache.org>
Subject: Re: Leveraging S3 select

On 29 Nov 2017, at 21:45, Lalwani, Jayesh <Ja...@capitalone.com>> wrote:

AWS announced at re:Invent that they are launching S3 Select. This can allow Spark to push down predicates to S3, rather than read the entire file in memory. Are there any plans to update Spark to use S3 Select?

  1.  ORC and Parquet don't read the whole file in memory anyway, except in the special case that the file is gzipped
  2.  Hadoop's s3a <= 2.7 doesn't handle the aggressive seeks of those columnar formats that well, as it does a GET pos-EOF & has to abort the TCP connection if the seek is backwards
  3.  Hadoop 2.8+ with spark.hadoop.fs.s3a.experimental.fadvise=random switches to random IO and only does smaller GET reads of the data requested (actually min(min-read-length, buffer-size). This delivers ~3x performance boost in TCP-DS benchmarks

I don't yet know how much more efficient the new mechanism will be against columnar data, given those facts. You'd need to do experiments

The place to implement this would be though predicate push down from the file format to the FS. ORC & Parquet support predicate pushdown, so they'd need to recognise when the underlying store could do some of the work for them, open the store input stream differently, and use a whole new (undefined?) API to the queries. Most likely: s3a would add a way to specify a predicate to select on in open(), as well as the expected file type. This would need the underlying mechanism to also support those formats though, which the announcement doesn't/

Someone could do something more immediately though some modified CSV data source which did the pushdown. However, If you are using CSV for your datasets, there's something fundamental w.r.t your data storage policy you need to look at. It works sometimes as an exchange format, though I prefer Avro there due to its schemas and support for more complex structures.  As a format you run queries over? No.

Re: Leveraging S3 select

Posted by Steve Loughran <st...@hortonworks.com>.

On 29 Nov 2017, at 21:45, Lalwani, Jayesh <Ja...@capitalone.com>> wrote:

AWS announced at re:Invent that they are launching S3 Select. This can allow Spark to push down predicates to S3, rather than read the entire file in memory. Are there any plans to update Spark to use S3 Select?

1. ORC and Parquet don't read the whole file in memory anyway, except in the special case that the file is gzipped
2. Hadoop's s3a <= 2.7 doesn't handle the aggressive seeks of those columnar formats that well, as it does a GET pos-EOF & has to abort the TCP connection if the seek is backwards
3. Hadoop 2.8+ with spark.hadoop.fs.s3a.experimental.fadvise=random switches to random IO and only does smaller GET reads of the data requested (actually min(min-read-length, buffer-size). This delivers ~3x performance boost in TCP-DS benchmarks

I don't yet know how much more efficient the new mechanism will be against columnar data, given those facts. You'd need to do experiments

The place to implement this would be though predicate push down from the file format to the FS. ORC & Parquet support predicate pushdown, so they'd need to recognise when the underlying store could do some of the work for them, open the store input stream differently, and use a whole new (undefined?) API to the queries. Most likely: s3a would add a way to specify a predicate to select on in open(), as well as the expected file type. This would need the underlying mechanism to also support those formats though, which the announcement doesn't/

Someone could do something more immediately though some modified CSV data source which did the pushdown. However, If you are using CSV for your datasets, there's something fundamental w.r.t your data storage policy you need to look at. It works sometimes as an exchange format, though I prefer Avro there due to its schemas and support for more complex structures. As a format you run queries over? No.