You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by David Rosenstrauch <da...@darose.net> on 2015/09/28 22:44:41 UTC

Reading ORC Files from S3

A colleague of mine posted to this list a few months ago about some 
difficulties we were experiencing reading from ORC files stored on 
Amazon S3.  What we were finding was that a set of ORC files that we 
built performed well on HDFS, but showed extremely poor performance when 
stored on S3.  I've been continuing my colleague's work, and have tried 
various and sundry fixes and tweaks to try to get the performance to 
improve, but so far to no avail.  I was hoping perhaps someone on the 
list here might be able to shed some light as to why we're having these 
problems and/or have some suggestions on how we might be able to work 
around them.


A bit more details about our issues:

We have 2 datasets that we've built which are stored as ORC files.  The 
first set is a series of records, sorted by record ID.  The second set 
is an inverted index into the first set, where each record contains a 
search key value followed by a record ID.  (The 2nd dataset is sorted by 
search key value.)  The first dataset contains ~4000 files, totaling 
500GB (i.e., ~120MB per file); the second also contains ~4000 files, but 
totaling nearly 2TB (~230MB per file).

What I'm finding is that queries against the first dataset (the records) 
complete in a fairly reasonable amount of time, but queries against the 
index dataset are taking a very long time.  This is completely contrary 
to what I would expect, as the index dataset should be better able to 
take advantage of the efficiencies built into the ORC data storage, and 
so should be able to be queried faster.  (I.e., theoretically ORC should 
be able to skip reading large portions of the index files by jumping 
directly to the index records that match the supplied search criteria. 
(Or at least jumping to a stripe close to them.))  But this is proving 
not to be the case.


All of the ORC files are generated using a custom map/reduce job with 
OrcNewOutputFormat (using Hive 0.13.1 jars) and are being queried via 
Hive queries (using Hive 1.1.0).  The files are initially written to 
HDFS, and then pushed to S3 (using distcp).  But my queries are all 
being done directly against the files stored on S3.  (I.e., a Hive 
external table with a LOCATION pointing to S3.)


I've tried various tweaks to the ORC file generation process - larger 
number of small files, smaller number of large files, stripe sizes 
varying from 64MB to 256MB, etc.  But nothing seems to make any 
difference.  Queries against the index dataset take a very long time no 
matter what I try - as in 4x-5x longer than querying the records dataset.

One other thing that I'm finding particularly strange here is that 
enabling predicate pushdown is seeming to have no effect here - and 
sometimes even makes things worse.  When I set 
"hive.optimize.index.filter=true" I can see that the predicate pushdown 
is taking effect via output in the Hadoop job logs.  But it doesn't seem 
like the predicate pushdown is able to make the query run any faster 
when the data is held on S3.

ORC isn't giving me much clue as to the cause for the delays either. 
When I look in the Hadoop job task logs, I see a message about the 
S3NativeFileSystem opening one of my ORC files ... and then 6-7 minutes 
pass before I see the next log message about Hive starting to process 
the records in the file.


One other thing I've noticed is that I don't seem to be the only one 
experiencing this issue.  Googling on this topic turned up a few other 
people with a similar problem, most notably the blog post at 
http://bitmusings.tumblr.com/post/56081787247/orc-files-in-hive where 
the author wound up finding the performance so bad that he switched from 
using S3 native storage format to using the S3 block storage format in 
order to work around these issues.


So .... anyone have any ideas as to what might be causing this issue 
and/or how to work around?  Is ORC simply unable to work efficiently 
against data stored on S3n?  (I.e., due to network round-trips taking 
too long.)  Any help anyone could offer would be greatly appreciated! 
This is proving to be a blocker issue for my project, and if I can't 
find a solution I'm likely going to wind up having to scrap the idea of 
using ORC to store the index.

Thanks!

Best,

DR

Re: Reading ORC Files from S3

Posted by David Rosenstrauch <da...@darose.net>.

Great!  I'll follow up with you guys off-list.

DR

On 09/29/2015 02:00 AM, Gopal Vijayaraghavan wrote:
> Hi,
>
>> OK, well that was easy.  Figured out my issue and managed to get ORC
>> working over s3a.  And got a huge speed-up over s3n!  (On the order of
>> 10x!)
>
> Cool! S3n is rather old now, while the aws-sdk updates keep s3a moving.
>
>> So yeah, I'm game for testing some new code when/if you're feeling
>> motivated to work on this.  Feel free to email me off-list and we can
>> get into the details.
>
> +Rajesh - who's actively chasing down the ORC + S3 changes today.
>
> Your email came at an opportune moment, since Rajesh's ORC changes landed
> in hive-2.0 branch today
>
> https://github.com/apache/hive/commit/a4c43f0335b33a75d2e9f3dc53b3cd33f8f11
> 5cf
>
>
> Cheers,
> Gopal
>
>>
>> On 09/28/2015 10:43 PM, David Rosenstrauch wrote:
>>> Super helpful response - thanks so much!  At least I know I'm not crazy
>>> now!  (And shouldn't spend any more time on tweaks trying to get this to
>>> work on s3n.)
>>>
>>> Let me try to start testing this using out-of-the-box s3a protocol.  (I
>>> haven't been able to get that to work at all yet - keep getting "Unable
>>> to load AWS credentials from any provider in the chain" errors.)  Once
>>> I'm able to get that far I'd be up for trying to test some new code. (As
>>> long as it doesn't wind up taking too much time.)
>>>
>>> Will report back soon.
>>>
>>> Thanks again!
>>>
>>> DR
>>>
>>> On 09/28/2015 06:14 PM, Gopal Vijayaraghavan wrote:
>>>>> avail.  I was hoping perhaps someone on the list here might
>>>>> be able to shed some light as to why we're having these problems
>>>>> and/or
>>>>> have some suggestions on how we might be able to work around them.
>>>> ...
>>>>>    (I.e., theoretically ORC should be able to skip reading large
>>>>> portions
>>>>> of the index files by jumping directly to the index
>>>>> records that match the supplied search criteria. (Or at least jumping
>>>>> to
>>>>> a stripe close to them.))  But this is proving not to be the case.
>>>>
>>>> Not theoretically. ORC does that and that's the issue.
>>>>
>>>> S3n is badly broken for a columnar format & even S3A is missing a
>>>> couple
>>>> of features which are essential to get read performance over HTTP.
>>>>
>>>> Here's one example - every seek() disconnects & restablishes an SSL
>>>> connection in S3 (that fix is a ~2x perf increase for S3a).
>>>>
>>>> https://issues.apache.org/jira/browse/HADOOP-12444
>>>>
>>>>
>>>> In another scenario we found that a readFully(colOffset,... colSize)
>>>> will
>>>> open an unbounded reader in S3n instead of reading the fixed chunk off
>>>> HTTP.
>>>>
>>>> https://issues.apache.org/jira/browse/HADOOP-11867
>>>>
>>>>
>>>> The lack of this means that even the short-live keep-alive gets turned
>>>> off
>>>> by the S3 impl, when doing a forward-seek read pattern, because it is a
>>>> recv buffer-dropping disconnect, not a complete request.
>>>>
>>>> The Amazon proprietary S3 drivers are not subject to these problems, so
>>>> they work with ORC very well. It's the open source S3 filesystem impls
>>>> which are holding us back.
>>>>
>>>>> Is ORC simply unable to work efficiently against data stored on S3n?
>>>>> (I.e., due to network round-trips taking too long.)
>>>>
>>>> S3n is unable to handle any columnar format efficiently - it fires an
>>>> HTTP
>>>> GET for each seek, marked till end of the file. Any format which
>>>> requires
>>>> forward seeks or bounded readers is going to die via TCP window &
>>>> round-trip thrashing.
>>>>
>>>>
>>>> I know what's needed for s3a to work well with columnar readers
>>>> (Parquet/ORC/RCFile included) and future proof it so that it will work
>>>> fine when HTTP/2 arrives.
>>>>
>>>> If you're interested in being guinea pig for S3a fixes, it is currently
>>>> sitting on my back burner (I'm not a hadoop committer) - the FS fixes
>>>> are
>>>> about two weeks worth of work for a single motivated dev.
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>
>>
>>
>

Re: Reading ORC Files from S3

Posted by Gopal Vijayaraghavan <go...@hortonworks.com>.

Hi,

>OK, well that was easy.  Figured out my issue and managed to get ORC
>working over s3a.  And got a huge speed-up over s3n!  (On the order of
>10x!)

Cool! S3n is rather old now, while the aws-sdk updates keep s3a moving.

>So yeah, I'm game for testing some new code when/if you're feeling
>motivated to work on this.  Feel free to email me off-list and we can
>get into the details.

+Rajesh - who's actively chasing down the ORC + S3 changes today.

Your email came at an opportune moment, since Rajesh's ORC changes landed
in hive-2.0 branch today

https://github.com/apache/hive/commit/a4c43f0335b33a75d2e9f3dc53b3cd33f8f11
5cf


Cheers,
Gopal

>
>On 09/28/2015 10:43 PM, David Rosenstrauch wrote:
>> Super helpful response - thanks so much!  At least I know I'm not crazy
>> now!  (And shouldn't spend any more time on tweaks trying to get this to
>> work on s3n.)
>>
>> Let me try to start testing this using out-of-the-box s3a protocol.  (I
>> haven't been able to get that to work at all yet - keep getting "Unable
>> to load AWS credentials from any provider in the chain" errors.)  Once
>> I'm able to get that far I'd be up for trying to test some new code. (As
>> long as it doesn't wind up taking too much time.)
>>
>> Will report back soon.
>>
>> Thanks again!
>>
>> DR
>>
>> On 09/28/2015 06:14 PM, Gopal Vijayaraghavan wrote:
>>>> avail.  I was hoping perhaps someone on the list here might
>>>> be able to shed some light as to why we're having these problems
>>>>and/or
>>>> have some suggestions on how we might be able to work around them.
>>> ...
>>>>   (I.e., theoretically ORC should be able to skip reading large
>>>>portions
>>>> of the index files by jumping directly to the index
>>>> records that match the supplied search criteria. (Or at least jumping
>>>>to
>>>> a stripe close to them.))  But this is proving not to be the case.
>>>
>>> Not theoretically. ORC does that and that's the issue.
>>>
>>> S3n is badly broken for a columnar format & even S3A is missing a
>>>couple
>>> of features which are essential to get read performance over HTTP.
>>>
>>> Here's one example - every seek() disconnects & restablishes an SSL
>>> connection in S3 (that fix is a ~2x perf increase for S3a).
>>>
>>> https://issues.apache.org/jira/browse/HADOOP-12444
>>>
>>>
>>> In another scenario we found that a readFully(colOffset,... colSize)
>>>will
>>> open an unbounded reader in S3n instead of reading the fixed chunk off
>>> HTTP.
>>>
>>> https://issues.apache.org/jira/browse/HADOOP-11867
>>>
>>>
>>> The lack of this means that even the short-live keep-alive gets turned
>>> off
>>> by the S3 impl, when doing a forward-seek read pattern, because it is a
>>> recv buffer-dropping disconnect, not a complete request.
>>>
>>> The Amazon proprietary S3 drivers are not subject to these problems, so
>>> they work with ORC very well. It's the open source S3 filesystem impls
>>> which are holding us back.
>>>
>>>> Is ORC simply unable to work efficiently against data stored on S3n?
>>>> (I.e., due to network round-trips taking too long.)
>>>
>>> S3n is unable to handle any columnar format efficiently - it fires an
>>> HTTP
>>> GET for each seek, marked till end of the file. Any format which
>>>requires
>>> forward seeks or bounded readers is going to die via TCP window &
>>> round-trip thrashing.
>>>
>>>
>>> I know what's needed for s3a to work well with columnar readers
>>> (Parquet/ORC/RCFile included) and future proof it so that it will work
>>> fine when HTTP/2 arrives.
>>>
>>> If you're interested in being guinea pig for S3a fixes, it is currently
>>> sitting on my back burner (I'm not a hadoop committer) - the FS fixes
>>>are
>>> about two weeks worth of work for a single motivated dev.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>
>
>

Re: Reading ORC Files from S3

Posted by David Rosenstrauch <da...@darose.net>.

OK, well that was easy.  Figured out my issue and managed to get ORC 
working over s3a.  And got a huge speed-up over s3n!  (On the order of 10x!)

So yeah, I'm game for testing some new code when/if you're feeling 
motivated to work on this.  Feel free to email me off-list and we can 
get into the details.

Best,

DR

On 09/28/2015 10:43 PM, David Rosenstrauch wrote:
> Super helpful response - thanks so much!  At least I know I'm not crazy
> now!  (And shouldn't spend any more time on tweaks trying to get this to
> work on s3n.)
>
> Let me try to start testing this using out-of-the-box s3a protocol.  (I
> haven't been able to get that to work at all yet - keep getting "Unable
> to load AWS credentials from any provider in the chain" errors.)  Once
> I'm able to get that far I'd be up for trying to test some new code. (As
> long as it doesn't wind up taking too much time.)
>
> Will report back soon.
>
> Thanks again!
>
> DR
>
> On 09/28/2015 06:14 PM, Gopal Vijayaraghavan wrote:
>>> avail.  I was hoping perhaps someone on the list here might
>>> be able to shed some light as to why we're having these problems and/or
>>> have some suggestions on how we might be able to work around them.
>> ...
>>>   (I.e., theoretically ORC should be able to skip reading large portions
>>> of the index files by jumping directly to the index
>>> records that match the supplied search criteria. (Or at least jumping to
>>> a stripe close to them.))  But this is proving not to be the case.
>>
>> Not theoretically. ORC does that and that's the issue.
>>
>> S3n is badly broken for a columnar format & even S3A is missing a couple
>> of features which are essential to get read performance over HTTP.
>>
>> Here's one example - every seek() disconnects & restablishes an SSL
>> connection in S3 (that fix is a ~2x perf increase for S3a).
>>
>> https://issues.apache.org/jira/browse/HADOOP-12444
>>
>>
>> In another scenario we found that a readFully(colOffset,... colSize) will
>> open an unbounded reader in S3n instead of reading the fixed chunk off
>> HTTP.
>>
>> https://issues.apache.org/jira/browse/HADOOP-11867
>>
>>
>> The lack of this means that even the short-live keep-alive gets turned
>> off
>> by the S3 impl, when doing a forward-seek read pattern, because it is a
>> recv buffer-dropping disconnect, not a complete request.
>>
>> The Amazon proprietary S3 drivers are not subject to these problems, so
>> they work with ORC very well. It's the open source S3 filesystem impls
>> which are holding us back.
>>
>>> Is ORC simply unable to work efficiently against data stored on S3n?
>>> (I.e., due to network round-trips taking too long.)
>>
>> S3n is unable to handle any columnar format efficiently - it fires an
>> HTTP
>> GET for each seek, marked till end of the file. Any format which requires
>> forward seeks or bounded readers is going to die via TCP window &
>> round-trip thrashing.
>>
>>
>> I know what's needed for s3a to work well with columnar readers
>> (Parquet/ORC/RCFile included) and future proof it so that it will work
>> fine when HTTP/2 arrives.
>>
>> If you're interested in being guinea pig for S3a fixes, it is currently
>> sitting on my back burner (I'm not a hadoop committer) - the FS fixes are
>> about two weeks worth of work for a single motivated dev.
>>
>> Cheers,
>> Gopal
>>
>>
>

Re: Reading ORC Files from S3

Posted by David Rosenstrauch <da...@darose.net>.

Super helpful response - thanks so much!  At least I know I'm not crazy 
now!  (And shouldn't spend any more time on tweaks trying to get this to 
work on s3n.)

Let me try to start testing this using out-of-the-box s3a protocol.  (I 
haven't been able to get that to work at all yet - keep getting "Unable 
to load AWS credentials from any provider in the chain" errors.)  Once 
I'm able to get that far I'd be up for trying to test some new code. 
(As long as it doesn't wind up taking too much time.)

Will report back soon.

Thanks again!

DR

On 09/28/2015 06:14 PM, Gopal Vijayaraghavan wrote:
>> avail.  I was hoping perhaps someone on the list here might
>> be able to shed some light as to why we're having these problems and/or
>> have some suggestions on how we might be able to work around them.
> ...
>>   (I.e., theoretically ORC should be able to skip reading large portions
>> of the index files by jumping directly to the index
>> records that match the supplied search criteria. (Or at least jumping to
>> a stripe close to them.))  But this is proving not to be the case.
>
> Not theoretically. ORC does that and that's the issue.
>
> S3n is badly broken for a columnar format & even S3A is missing a couple
> of features which are essential to get read performance over HTTP.
>
> Here's one example - every seek() disconnects & restablishes an SSL
> connection in S3 (that fix is a ~2x perf increase for S3a).
>
> https://issues.apache.org/jira/browse/HADOOP-12444
>
>
> In another scenario we found that a readFully(colOffset,... colSize) will
> open an unbounded reader in S3n instead of reading the fixed chunk off
> HTTP.
>
> https://issues.apache.org/jira/browse/HADOOP-11867
>
>
> The lack of this means that even the short-live keep-alive gets turned off
> by the S3 impl, when doing a forward-seek read pattern, because it is a
> recv buffer-dropping disconnect, not a complete request.
>
> The Amazon proprietary S3 drivers are not subject to these problems, so
> they work with ORC very well. It's the open source S3 filesystem impls
> which are holding us back.
>
>> Is ORC simply unable to work efficiently against data stored on S3n?
>> (I.e., due to network round-trips taking too long.)
>
> S3n is unable to handle any columnar format efficiently - it fires an HTTP
> GET for each seek, marked till end of the file. Any format which requires
> forward seeks or bounded readers is going to die via TCP window &
> round-trip thrashing.
>
>
> I know what's needed for s3a to work well with columnar readers
> (Parquet/ORC/RCFile included) and future proof it so that it will work
> fine when HTTP/2 arrives.
>
> If you're interested in being guinea pig for S3a fixes, it is currently
> sitting on my back burner (I'm not a hadoop committer) - the FS fixes are
> about two weeks worth of work for a single motivated dev.
>
> Cheers,
> Gopal
>
>

Re: Reading ORC Files from S3

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> avail.  I was hoping perhaps someone on the list here might
> be able to shed some light as to why we're having these problems and/or
>have some suggestions on how we might be able to work around them.
...
>  (I.e., theoretically ORC should be able to skip reading large portions
>of the index files by jumping directly to the index
> records that match the supplied search criteria. (Or at least jumping to
>a stripe close to them.))  But this is proving not to be the case.

Not theoretically. ORC does that and that's the issue.

S3n is badly broken for a columnar format & even S3A is missing a couple
of features which are essential to get read performance over HTTP.

Here's one example - every seek() disconnects & restablishes an SSL
connection in S3 (that fix is a ~2x perf increase for S3a).

https://issues.apache.org/jira/browse/HADOOP-12444


In another scenario we found that a readFully(colOffset,... colSize) will
open an unbounded reader in S3n instead of reading the fixed chunk off
HTTP.

https://issues.apache.org/jira/browse/HADOOP-11867


The lack of this means that even the short-live keep-alive gets turned off
by the S3 impl, when doing a forward-seek read pattern, because it is a
recv buffer-dropping disconnect, not a complete request.

The Amazon proprietary S3 drivers are not subject to these problems, so
they work with ORC very well. It's the open source S3 filesystem impls
which are holding us back.

> Is ORC simply unable to work efficiently against data stored on S3n?
>(I.e., due to network round-trips taking too long.)

S3n is unable to handle any columnar format efficiently - it fires an HTTP
GET for each seek, marked till end of the file. Any format which requires
forward seeks or bounded readers is going to die via TCP window &
round-trip thrashing.


I know what's needed for s3a to work well with columnar readers
(Parquet/ORC/RCFile included) and future proof it so that it will work
fine when HTTP/2 arrives.

If you're interested in being guinea pig for S3a fixes, it is currently
sitting on my back burner (I'm not a hadoop committer) - the FS fixes are
about two weeks worth of work for a single motivated dev.

Cheers,
Gopal