You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Wei Chen <we...@gmail.com> on 2016/02/02 19:07:42 UTC

optimal way to load parquet files with partition

Hi All,

I have data partitioned by year=yyyy/month=mm/day=dd, what is the best way
to get two months of data from a given year (let's say June and July)?

Two ways I can think of:
1. use unionAll
df1 = sqc.read.parquet('xxx/year=2015/month=6')
df2 = sqc.read.parquet('xxx/year=2015/month=7')
df = df1.unionAll(df2)

2. use filter after load the whole year
df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')

Which of the above is better? Or are there better ways to handle this?


Thank you,
Wei

Re: optimal way to load parquet files with partition

Posted by Michael Armbrust <mi...@databricks.com>.

It depends how many partitions you have and if you are only doing a single
operation.  Loading all the data and filtering will require us to scan the
directories to discover all the months.  This information will be cached.
Then we should prune and avoid reading unneeded data.

Option 1 does not require this scan, but is more work for the developer.

On Tue, Feb 2, 2016 at 10:07 AM, Wei Chen <we...@gmail.com> wrote:

> Hi All,
>
> I have data partitioned by year=yyyy/month=mm/day=dd, what is the best way
> to get two months of data from a given year (let's say June and July)?
>
> Two ways I can think of:
> 1. use unionAll
> df1 = sqc.read.parquet('xxx/year=2015/month=6')
> df2 = sqc.read.parquet('xxx/year=2015/month=7')
> df = df1.unionAll(df2)
>
> 2. use filter after load the whole year
> df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')
>
> Which of the above is better? Or are there better ways to handle this?
>
>
> Thank you,
> Wei
>