You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Edmon Begoli <eb...@gmail.com> on 2016/04/15 03:50:03 UTC

ORC's slow(er) performance with MLlib

Hi,

We are running some experiments with Spark and ORC, Parquet and plain CSV
files and we are observing some interesting effects.

The dataset we are initially looking into is smallish - ~100 MB (CSV) and
we encode it into Parquet and ORC.

When we run Spark SQL aggregate queries we get an insane performance
speedup. close to 10x, and consistently 2-4x.

General count queries are slower.

When we run MLlib random trees, we get very unusual performance result.

CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.

I have few intuitions for why is performance on the aggregate queries so
good (sub-indexing and row/column groups internal statistics), but I am not
quite clear on the performance on random forrest.

Is ORC decoding algorithm or data retrieval inefficient for these kinds of
ML jobs?

This is for a performance study, so any insight would be highly
appreciated.

Re: ORC's slow(er) performance with MLlib

Posted by István <le...@gmail.com>.

Hi Edmon,

First and foremost would you mind explaining the exact setup you have in
your infrastructure? Performance is a very complex subject and it has many
moving parts that you might not be aware of.  You can think of it as the
sum of its parts though. One problem is that you are comparing different
systems that use different code paths for execution. Anyways, lets assume
that everything is the same and the only difference indeed the file format
you picked.

ORC has a footer[1] and also and index[2] to speed up queries you
mentioned. This might be one reason why  you see the performance
characteristics you see.

If you let me know about your setup I could re-create your test here, but I
would like to advise you to greatly increase the test data size. In my
experience ORC shines when you have huge amount of data in the few
terabytes to petabytes range and you also have high repetition (userids,
hashes, etc.) within the stripes. You can obviously use it for other
things, just might not worth it. I have very limited knowledge about
Parquet, hopefully somebody can chime in and add some content about that.

1.

The file footer contains a list of stripes in the file, the number of rows
per stripe, and each column's data type. It also contains column-level
aggregates count, min, max, and sum.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-FileStructure

2.

Furthermore, ORC files include light weight indexes that
include the minimum and maximum values for each column in each set of
10,000 rows and the entire file. Using pushdown filters from Hive, the
file reader can skip entire sets of rows that aren't important for
this query.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-Introduction

Best regards,
Istvan

-- 
*Istvan Szukacs*
CTO

+31647081521
istvan@streambrightdata.com
https://www.streambrightdata.com/

On Fri, Apr 15, 2016 at 3:50 AM, Edmon Begoli <eb...@gmail.com> wrote:

> Hi,
>
> We are running some experiments with Spark and ORC, Parquet and plain CSV
> files and we are observing some interesting effects.
>
> The dataset we are initially looking into is smallish - ~100 MB (CSV) and
> we encode it into Parquet and ORC.
>
> When we run Spark SQL aggregate queries we get an insane performance
> speedup. close to 10x, and consistently 2-4x.
>
> General count queries are slower.
>
> When we run MLlib random trees, we get very unusual performance result.
>
> CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.
>
> I have few intuitions for why is performance on the aggregate queries so
> good (sub-indexing and row/column groups internal statistics), but I am not
> quite clear on the performance on random forrest.
>
> Is ORC decoding algorithm or data retrieval inefficient for these kinds of
> ML jobs?
>
> This is for a performance study, so any insight would be highly
> appreciated.
>

-- 
the sun shines for all

Re: ORC's slow(er) performance with MLlib

Posted by Edmon Begoli <eb...@gmail.com>.

Owen,

I am on travel, but I will try to send that over as soon as I am back.

It should not be a problem.

Edmon

On Mon, Apr 18, 2016 at 1:12 PM, Owen O'Malley <om...@apache.org> wrote:

> Edmon,
>    I'd love to help figure out what it going on. A couple of questions:
>
> * What file system are you reading from? HDFS? one of the S3-based ones?
> local?
> * Would it be possible to send me (omalley@apache.org) the file's
> metadata from orcfiledump?
> * Do you know if MLlib is having the reader seek? At 100MB, it should just
> read the file into memory.
>
> Thanks,
>    Owen
>
> On Thu, Apr 14, 2016 at 9:50 PM, Edmon Begoli <eb...@gmail.com> wrote:
>
>> Hi,
>>
>> We are running some experiments with Spark and ORC, Parquet and plain CSV
>> files and we are observing some interesting effects.
>>
>> The dataset we are initially looking into is smallish - ~100 MB (CSV) and
>> we encode it into Parquet and ORC.
>>
>> When we run Spark SQL aggregate queries we get an insane performance
>> speedup. close to 10x, and consistently 2-4x.
>>
>> General count queries are slower.
>>
>> When we run MLlib random trees, we get very unusual performance result.
>>
>> CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.
>>
>> I have few intuitions for why is performance on the aggregate queries so
>> good (sub-indexing and row/column groups internal statistics), but I am not
>> quite clear on the performance on random forrest.
>>
>> Is ORC decoding algorithm or data retrieval inefficient for these kinds
>> of ML jobs?
>>
>> This is for a performance study, so any insight would be highly
>> appreciated.
>>
>
>

Re: ORC's slow(er) performance with MLlib

Posted by Owen O'Malley <om...@apache.org>.

Edmon,
   I'd love to help figure out what it going on. A couple of questions:

* What file system are you reading from? HDFS? one of the S3-based ones?
local?
* Would it be possible to send me (omalley@apache.org) the file's metadata
from orcfiledump?
* Do you know if MLlib is having the reader seek? At 100MB, it should just
read the file into memory.

Thanks,
   Owen

On Thu, Apr 14, 2016 at 9:50 PM, Edmon Begoli <eb...@gmail.com> wrote:

> Hi,
>
> We are running some experiments with Spark and ORC, Parquet and plain CSV
> files and we are observing some interesting effects.
>
> The dataset we are initially looking into is smallish - ~100 MB (CSV) and
> we encode it into Parquet and ORC.
>
> When we run Spark SQL aggregate queries we get an insane performance
> speedup. close to 10x, and consistently 2-4x.
>
> General count queries are slower.
>
> When we run MLlib random trees, we get very unusual performance result.
>
> CSV run takes about 40 seconds, Parquet about ~25 and ORC 60 sec.
>
> I have few intuitions for why is performance on the aggregate queries so
> good (sub-indexing and row/column groups internal statistics), but I am not
> quite clear on the performance on random forrest.
>
> Is ORC decoding algorithm or data retrieval inefficient for these kinds of
> ML jobs?
>
> This is for a performance study, so any insight would be highly
> appreciated.
>