You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Steve Lindemann <sr...@gmail.com> on 2015/07/05 23:47:26 UTC

efficiently accessing partition data for datasets in S3 with SparkSQL

I'm trying to use SparkSQL to efficiently query structured data from
datasets in S3. The data is naturally partitioned by date, so I've laid it
out in S3 as follows:

s3://bucket/dataset/dt=2015-07-05/
s3://bucket/dataset/dt=2015-07-04/
s3://bucket/dataset/dt=2015-07-03/
etc.

In each directory, data for the given date would be written as Parquet
(although that's not a hard requirement if there's a more efficient way to
do it). I have data going back a couple of decades, so we're looking at
thousands of partitions. In each partition, the day's data is anywhere from
100MB to 5GB.

Most queries only use data from one to five partitions (i.e., they filter
on the partition column very tightly). Ideally, I would like my queries to
run quickly in this case, ignoring data in irrelevant partitions and only
accessing data in the required partitions. However, early tests seem to
show Spark reading Parquet metadata for all files in the dataset, even
those excluded by a predicate filter (and pruned by DataSourceStrategy).

The result of this is that a query that should take about a second or two,
spends a minute or more reading metadata for irrelevant partitions before
extracting the data of interest. This greatly limits my ability to use the
system, since a very common use case to run a script, download query data
into a file, and return (now, running the script takes 1m instead of 2-3s).

Am I missing something here? It feels to me like this is something that
should be easy, but I'm not finding it to be. Can anyone suggest why this
might be happening? Am I approaching this problem the wrong way?

Thanks for your help.