You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Quanlong Huang <hu...@126.com> on 2018/01/26 11:46:22 UTC

[Ready for Review] IMPALA-5717: Support reading from ORC format files

Hi friends,


I'm very excited that our ORC-support patch has passed all the tests and is ready for review! To ease your work, we wrote a brief document about our solution: https://docs.google.com/document/d/1Lg-MmZIis-ZbmMf6cD8YJq4x2tM0UXYPyzf0AYqe6Gc


In short, we integrated the mature orc-reader (https://github.com/apache/orc/tree/master/c%2B%2B) into Impala. As a first step, we only support reading primitive types.


Finally, here is the review link: https://gerrit.cloudera.org/#/c/9134/


Hope this feature can be accepted! Any feedback is welcome!


Thanks,
Quanlong Huang
Hulu

Re:Re: [Ready for Review] IMPALA-5717: Support reading from ORC format files

Posted by Quanlong Huang <hu...@126.com>.
Hi Tim,>* Assuming we end up using the ORC C++ library, we probably want to manage
>it in the same way that we do Avro by building it externally and then
>linking against it (we use the native-toolchain project for convenience).
>Importing the code seems OK at least for the initial review though.Linking the ORC library outside looks more elegant. However, I think it might loss some oppotunitis for further optimization. Looks like every system has it's own orc reader actually. For example, Presto has a module for ORC: https://github.com/prestodb/presto/tree/master/presto-orc>* It sucks that the ORC library can't handle allocation failures
>gracefully. I need to think about it more. One option is to contribute
>fixes back to the ORC project.
The ORC library will only allocate memory throught orc::MemoryPool::malloc. This is the function we implemented. We can throw an exception inside this function for allocation failures to stop the reader, and catch the exception outside the reader to handle it. This is our current solution, to be able to not change any codes of the reader.
>* This is definitely going to conflict with
>https://gerrit.cloudera.org/#/c/8966/ (use reservations for scans). I don't
>know how deterministic the ORC readers memory requirements are, but if it
>assumes it can allocate an arbitrary amount of memory, then it may be
>problematic.
We can track and controll the memory consumption in the implemetation of orc::MemoryPool. So I think this is safe.


Finally, since ORC is another columnar file format much like Parquet, we hope the orc-reader can be merged into Impala organically. That is, the parquet-scanner and orc-scanner can share many logics, instead of using totally indenpend code path. For example, the codes of run-length decoding can be shared with both scanners. We can also handle the allocation failures inside the orc-reader more elegantly.


This patch which merges ORC library codes inside Impala is just a prototype to prove that it can work correctly with modules in Impala. We can start from it to make it merged organically.


At 2018-01-27 01:02:03, "Tim Armstrong" <ta...@cloudera.com> wrote: >Thank you! > >I had few higher-level questions or thoughts: > >* Assuming we end up using the ORC C++ library, we probably want to manage >it in the same way that we do Avro by building it externally and then >linking against it (we use the native-toolchain project for convenience). >Importing the code seems OK at least for the initial review though. >* It sucks that the ORC library can't handle allocation failures >gracefully. I need to think about it more. One option is to contribute >fixes back to the ORC project. >* This is definitely going to conflict with >https://gerrit.cloudera.org/#/c/8966/ (use reservations for scans). I don't >know how deterministic the ORC readers memory requirements are, but if it >assumes it can allocate an arbitrary amount of memory, then it may be >problematic. I'll have to think about this. I don't necessarily think you >should have to do the work to switch the ORC scanner to the new paradigm. >One would be to merge the ORC reader to a branch post-code-review and then >I could do the necessary work to switch it to the new paradigm (since I've >already done it for all the other scanners). > >On Fri, Jan 26, 2018 at 3:46 AM, Quanlong Huang <hu...@126.com> >wrote: > >> Hi friends, >> >> I'm very excited that our ORC-support patch has passed all the tests and >> is ready for review! To ease your work, we wrote a brief document about our >> solution: https://docs.google.com/document/d/1Lg-MmZIis- >> ZbmMf6cD8YJq4x2tM0UXYPyzf0AYqe6Gc >> >> In short, we integrated the mature orc-reader (https://github.com/apache/ >> orc/tree/master/c%2B%2B) into Impala. As a first step, we only support >> reading primitive types. >> >> Finally, here is the review link: https://gerrit.cloudera.org/#/c/9134/ >> >> Hope this feature can be accepted! Any feedback is welcome! >> >> Thanks, >> Quanlong Huang >> Hulu >> >> >> >>

Re: [Ready for Review] IMPALA-5717: Support reading from ORC format files

Posted by Edward Capriolo <ed...@gmail.com>.
On Fri, Jan 26, 2018 at 12:02 PM, Tim Armstrong <ta...@cloudera.com>
wrote:

> Thank you!
>
> I had few higher-level questions or thoughts:
>
> * Assuming we end up using the ORC C++ library, we probably want to manage
> it in the same way that we do Avro by building it externally and then
> linking against it (we use the native-toolchain project for convenience).
> Importing the code seems OK at least for the initial review though.
> * It sucks that the ORC library can't handle allocation failures
> gracefully. I need to think about it more. One option is to contribute
> fixes back to the ORC project.
> * This is definitely going to conflict with
> https://gerrit.cloudera.org/#/c/8966/ (use reservations for scans). I
> don't
> know how deterministic the ORC readers memory requirements are, but if it
> assumes it can allocate an arbitrary amount of memory, then it may be
> problematic. I'll have to think about this. I don't necessarily think you
> should have to do the work to switch the ORC scanner to the new paradigm.
> One would be to merge the ORC reader to a branch post-code-review and then
> I could do the necessary work to switch it to the new paradigm (since I've
> already done it for all the other scanners).
>
> On Fri, Jan 26, 2018 at 3:46 AM, Quanlong Huang <hu...@126.com>
> wrote:
>
> > Hi friends,
> >
> > I'm very excited that our ORC-support patch has passed all the tests and
> > is ready for review! To ease your work, we wrote a brief document about
> our
> > solution: https://docs.google.com/document/d/1Lg-MmZIis-
> > ZbmMf6cD8YJq4x2tM0UXYPyzf0AYqe6Gc
> >
> > In short, we integrated the mature orc-reader (
> https://github.com/apache/
> > orc/tree/master/c%2B%2B) into Impala. As a first step, we only support
> > reading primitive types.
> >
> > Finally, here is the review link: https://gerrit.cloudera.org/#/c/9134/
> >
> > Hope this feature can be accepted! Any feedback is welcome!
> >
> > Thanks,
> > Quanlong Huang
> > Hulu
> >
> >
> >
> >
>

This is really awesome and important work. This will help tear down huge
wall wedged between Hive and Impala. The wall helps no one and in the end
only drives users to seek out other solutions (aka cloud solutions like
BigQuery, Athena, Redshift)

Re: [Ready for Review] IMPALA-5717: Support reading from ORC format files

Posted by Tim Armstrong <ta...@cloudera.com>.
Thank you!

I had few higher-level questions or thoughts:

* Assuming we end up using the ORC C++ library, we probably want to manage
it in the same way that we do Avro by building it externally and then
linking against it (we use the native-toolchain project for convenience).
Importing the code seems OK at least for the initial review though.
* It sucks that the ORC library can't handle allocation failures
gracefully. I need to think about it more. One option is to contribute
fixes back to the ORC project.
* This is definitely going to conflict with
https://gerrit.cloudera.org/#/c/8966/ (use reservations for scans). I don't
know how deterministic the ORC readers memory requirements are, but if it
assumes it can allocate an arbitrary amount of memory, then it may be
problematic. I'll have to think about this. I don't necessarily think you
should have to do the work to switch the ORC scanner to the new paradigm.
One would be to merge the ORC reader to a branch post-code-review and then
I could do the necessary work to switch it to the new paradigm (since I've
already done it for all the other scanners).

On Fri, Jan 26, 2018 at 3:46 AM, Quanlong Huang <hu...@126.com>
wrote:

> Hi friends,
>
> I'm very excited that our ORC-support patch has passed all the tests and
> is ready for review! To ease your work, we wrote a brief document about our
> solution: https://docs.google.com/document/d/1Lg-MmZIis-
> ZbmMf6cD8YJq4x2tM0UXYPyzf0AYqe6Gc
>
> In short, we integrated the mature orc-reader (https://github.com/apache/
> orc/tree/master/c%2B%2B) into Impala. As a first step, we only support
> reading primitive types.
>
> Finally, here is the review link: https://gerrit.cloudera.org/#/c/9134/
>
> Hope this feature can be accepted! Any feedback is welcome!
>
> Thanks,
> Quanlong Huang
> Hulu
>
>
>
>