You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sushrut Ikhar <su...@gmail.com> on 2018/03/01 20:45:26 UTC

Re: parquet vs orc files

To add, schema evaluation is better for parquet compared to orc (at the
cost of a bit slowness) as orc is truly index based;
especially useful in case you would want to delete some column later.

Regards,

Sushrut Ikhar
[image: https://]about.me/sushrutikhar
<https://about.me/sushrutikhar?promo=email_sig>


On Fri, Feb 23, 2018 at 1:10 AM, Jörn Franke <jo...@gmail.com> wrote:

> Look at the documentation of the formats. In any case:
> * use additionally partitions on the filesystem
> * sort the data on filter columns - otherwise you do not benefit form
> min/max and bloom filters
>
>
>
> On 21. Feb 2018, at 22:58, Kane Kim <ka...@gmail.com> wrote:
>
> Thanks, how does min/max index work? Can spark itself configure bloom
> filters when saving as orc?
>
> On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> In the latest version both are equally well supported.
>>
>> You need to insert the data sorted on filtering columns
>> Then you will benefit from min max indexes and in case of orc additional
>> from bloom filters, if you configure them.
>> In any case I recommend also partitioning of files (do not confuse with
>> Spark partitioning ).
>>
>> What is best for you you have to figure out in a test. This highly
>> depends on the data and the analysis you want to do.
>>
>> > On 21. Feb 2018, at 21:54, Kane Kim <ka...@gmail.com> wrote:
>> >
>> > Hello,
>> >
>> > Which format is better supported in spark, parquet or orc?
>> > Will spark use internal sorting of parquet/orc files (and how to test
>> that)?
>> > Can spark save sorted parquet/orc files?
>> >
>> > Thanks!
>>
>
>