You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@orc.apache.org by Xiening Dai <xn...@live.com> on 2018/07/03 23:41:53 UTC

Arrow Support of Orc

Hi all,

Not sure if this has been brought up before - do we have plan to support Apache Arrow? Given its popularity and momentum recently, we might consider supporting Arrow format for Orc reader and writer. There’s an adapter for Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/arrow/adapters/orc but the implementation is inefficient. If we want to better integrate with arrow, we should avoid conversions between ColumnVectorBatch and arrow format.

Re: Re: Arrow Support of Orc

Posted by Deepak Majeti <ma...@gmail.com>.

Follow up to Owen's question, do you have an estimate on the performance
gains from implementing the native support?

Creating a new API for supporting Arrow is a good starting point. Can you
come up with a design document first?

On Mon, Jul 16, 2018 at 4:24 AM 周宇睿(闻拙) <yu...@alibaba-inc.com> wrote:

> Hi All:
>
> Currently Arrow provides a naive implementation on converting
> ColumnVectorBatch to Arrow’s RecordBatch, which involves a lot of overheads
> on memcopying and transcodeing.
>
> We would like to add a native api set to allow user directly reading data
> from ORC file to Arrow’s RecordBatch, the new api set will be separated
> from current ColumnVectorBatch api so that we won’t raise any backward
> compatibility issue.
>
> Creating a new api set is not an elegent solution and it requires more
> maintenance effort. But given Arrow’s currently momentum and it’s benefits
> on sharing columnar data across various platforms and data format. We
> believe it worth to enable Arrow support on ORC.
>
> Any advice would be appreciated.
> Thanks
> Yurui
>
> from Alimail macOS
>  ------------------Original Mail ------------------
> Sender:Xiening Dai <xn...@live.com>
> Send Date:Fri Jul 6 01:25:34 2018
> Recipients:dev@orc.apache.org <de...@orc.apache.org>
> Subject:Re: Arrow Support of Orc
> I haven’t done profiling. The major overhead I can see is the conversion
> from ColumnVectorBatch to Arrow’s RecordBatch, which involves memory copy
> and some transcoding. Also the current adapter only supports reading entire
> stripe as a batch, which in a lot of cases is not ideal. I agree that we
> should maintain backward compatibility. I am thinking if we could expose
> another set of interface for Arrow which is built on top of the same
> ColumnReader/ColumnWriter classes.
>
>
>
> > On Jul 5, 2018, at 8:01 AM, Owen O'Malley <ow...@gmail.com>
> wrote:
> >
> > I think improved Arrow C++ integration would be great. I haven't looked
> at
> > the current state of the work to see what could be better. I'd be against
> > making Arrow the default C++ API, but changes to the API to make things
> > faster for Arrow make sense. (Although as always, we need to worry about
> > backwards compatibility.)
> >
> > Have you tried benchmarking and profiling the current adapters to see
> where
> > the bottlenecks are?
> >
> > .. Owen
> >
> > On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xn...@live.com> wrote:
> >
> >> Hi all,
> >>
> >> Not sure if this has been brought up before - do we have plan to support
> >> Apache Arrow? Given its popularity and momentum recently, we might
> consider
> >> supporting Arrow format for Orc reader and writer. There’s an adapter
> for
> >> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
> >> arrow/adapters/orc but the implementation is inefficient. If we want to
> >> better integrate with arrow, we should avoid conversions between
> >> ColumnVectorBatch and arrow format.
> >>
>
>

-- 
regards,
Deepak Majeti

Re: Re: Arrow Support of Orc

Posted by "周宇睿(闻拙)" <yu...@alibaba-inc.com>.

Hi All:

Currently Arrow provides a naive implementation on converting ColumnVectorBatch to Arrow’s RecordBatch, which involves a lot of overheads on memcopying and transcodeing. 

We would like to add a native api set to allow user directly reading data from ORC file to Arrow’s RecordBatch, the new api set will be separated from current ColumnVectorBatch api so that we won’t raise any backward compatibility issue.

Creating a new api set is not an elegent solution and it requires more maintenance effort. But given Arrow’s currently momentum and it’s benefits on sharing columnar data across various platforms and data format. We believe it worth to enable Arrow support on ORC. 

Any advice would be appreciated.
Thanks
Yurui 

from Alimail macOS
 ------------------Original Mail ------------------
Sender:Xiening Dai <xn...@live.com>
Send Date:Fri Jul 6 01:25:34 2018
Recipients:dev@orc.apache.org <de...@orc.apache.org>
Subject:Re: Arrow Support of Orc
I haven’t done profiling. The major overhead I can see is the conversion from ColumnVectorBatch to Arrow’s RecordBatch, which involves memory copy and some transcoding. Also the current adapter only supports reading entire stripe as a batch, which in a lot of cases is not ideal. I agree that we should maintain backward compatibility. I am thinking if we could expose another set of interface for Arrow which is built on top of the same ColumnReader/ColumnWriter classes.



> On Jul 5, 2018, at 8:01 AM, Owen O'Malley <ow...@gmail.com> wrote:
> 
> I think improved Arrow C++ integration would be great. I haven't looked at
> the current state of the work to see what could be better. I'd be against
> making Arrow the default C++ API, but changes to the API to make things
> faster for Arrow make sense. (Although as always, we need to worry about
> backwards compatibility.)
> 
> Have you tried benchmarking and profiling the current adapters to see where
> the bottlenecks are?
> 
> .. Owen
> 
> On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xn...@live.com> wrote:
> 
>> Hi all,
>> 
>> Not sure if this has been brought up before - do we have plan to support
>> Apache Arrow? Given its popularity and momentum recently, we might consider
>> supporting Arrow format for Orc reader and writer. There’s an adapter for
>> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
>> arrow/adapters/orc but the implementation is inefficient. If we want to
>> better integrate with arrow, we should avoid conversions between
>> ColumnVectorBatch and arrow format.
>>

Re: Arrow Support of Orc

Posted by Xiening Dai <xn...@live.com>.

I haven’t done profiling. The major overhead I can see is the conversion from ColumnVectorBatch to Arrow’s RecordBatch, which involves memory copy and some transcoding. Also the current adapter only supports reading entire stripe as a batch, which in a lot of cases is not ideal. I agree that we should maintain backward compatibility. I am thinking if we could expose another set of interface for Arrow which is built on top of the same ColumnReader/ColumnWriter classes.

> On Jul 5, 2018, at 8:01 AM, Owen O'Malley <ow...@gmail.com> wrote:
> 
> I think improved Arrow C++ integration would be great. I haven't looked at
> the current state of the work to see what could be better. I'd be against
> making Arrow the default C++ API, but changes to the API to make things
> faster for Arrow make sense. (Although as always, we need to worry about
> backwards compatibility.)
> 
> Have you tried benchmarking and profiling the current adapters to see where
> the bottlenecks are?
> 
> .. Owen
> 
> On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xn...@live.com> wrote:
> 
>> Hi all,
>> 
>> Not sure if this has been brought up before - do we have plan to support
>> Apache Arrow? Given its popularity and momentum recently, we might consider
>> supporting Arrow format for Orc reader and writer. There’s an adapter for
>> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
>> arrow/adapters/orc but the implementation is inefficient. If we want to
>> better integrate with arrow, we should avoid conversions between
>> ColumnVectorBatch and arrow format.
>>

Re: Arrow Support of Orc

Posted by Owen O'Malley <ow...@gmail.com>.

I think improved Arrow C++ integration would be great. I haven't looked at
the current state of the work to see what could be better. I'd be against
making Arrow the default C++ API, but changes to the API to make things
faster for Arrow make sense. (Although as always, we need to worry about
backwards compatibility.)

Have you tried benchmarking and profiling the current adapters to see where
the bottlenecks are?

.. Owen

On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xn...@live.com> wrote:

> Hi all,
>
> Not sure if this has been brought up before - do we have plan to support
> Apache Arrow? Given its popularity and momentum recently, we might consider
> supporting Arrow format for Orc reader and writer. There’s an adapter for
> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
> arrow/adapters/orc but the implementation is inefficient. If we want to
> better integrate with arrow, we should avoid conversions between
> ColumnVectorBatch and arrow format.
>