You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Fan Liya <li...@gmail.com> on 2019/07/11 04:01:04 UTC

[Discuss] Support an alternative memory layout for varchar/varbinary vectors

Hi all,


We are thinking of providing varchar/varbinary vectors with a different
memory layout which exists in a wide range of systems. The memory layout is
different from that of VarCharVector in the following ways:


   1.

   Instead of storing (start offset, end offset), the new layout stores
   (start offset, length)
   2.

   The content of varchars may not be in a consecutive memory region.
   Instead, it can be in arbitrary memory address.


Due to these differences in memory layout, it incurs performance overhead
when converting data between existing systems and VarCharVectors.

The above difference 1 seems insignificant, while difference 2 is difficult
to overcome. However, the scenario of difference 2 is prevalent in
practice: for example we store strings in a series of memory segments.
Whenever a segment is full, we request a new one. However, these memory
segments may not be consecutive, because other processes/threads are also
requesting/releasing memory segments in the meantime.

So we are wondering if it is possible to support such memory layout in
Arrow. I think there are more systems that are trying to adopting Arrow,
but are hindered by such difficulty.

Would you please give your valuable feedback?


Best,

Liya Fan

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by Antoine Pitrou <an...@python.org>.
Same as Uwe.

Regards

Antoine.


Le 11/07/2019 à 14:05, Uwe L. Korn a écrit :
> Hello Liya,
> 
> I'm quite -1 on this type as Arrow is about efficient columnar structures. We have opened the standard also to matrix-like types but always keep the constraint of consecutive memory. Now also adding types where memory is no longer consecutive but spread in the heap will make the scope of the project much wider (It seems that we then just turn into a general serialization framework).
> 
> One of the ideas of a common standard is that some need to make compromises. I think in this case it is a necessary compromise to not allow all kind of string representations.
> 
> Uwe
> 
> On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
>> Hi all,
>>
>>
>> We are thinking of providing varchar/varbinary vectors with a different
>> memory layout which exists in a wide range of systems. The memory layout is
>> different from that of VarCharVector in the following ways:
>>
>>
>>    1.
>>
>>    Instead of storing (start offset, end offset), the new layout stores
>>    (start offset, length)
>>    2.
>>
>>    The content of varchars may not be in a consecutive memory region.
>>    Instead, it can be in arbitrary memory address.
>>
>>
>> Due to these differences in memory layout, it incurs performance overhead
>> when converting data between existing systems and VarCharVectors.
>>
>> The above difference 1 seems insignificant, while difference 2 is difficult
>> to overcome. However, the scenario of difference 2 is prevalent in
>> practice: for example we store strings in a series of memory segments.
>> Whenever a segment is full, we request a new one. However, these memory
>> segments may not be consecutive, because other processes/threads are also
>> requesting/releasing memory segments in the meantime.
>>
>> So we are wondering if it is possible to support such memory layout in
>> Arrow. I think there are more systems that are trying to adopting Arrow,
>> but are hindered by such difficulty.
>>
>> Would you please give your valuable feedback?
>>
>>
>> Best,
>>
>> Liya Fan
>>

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by Fan Liya <li...@gmail.com>.
@Wes McKinney,

Thanks a lot for your comments and effort.
The JIRA looks good. I will track it.

Best,
Liya Fan

On Fri, Jul 12, 2019 at 10:31 PM Wes McKinney <we...@gmail.com> wrote:

> hi Liya -- yes, it seems reasonable to defer the conversion from your
> pointer-based extension representation to a proper VarCharVector until
> you need to send over IPC.
>
> Note that there is no mechanism yet in Java with extension types to
> cause a conversion to take place when the IPC step is reached.
>
> I just opened https://issues.apache.org/jira/browse/ARROW-5929 to try
> to explain this issue. Let me know if it is not clear
>
> I'm interested to experiment with the same thing in C++. We would have
> an ExtensionArray in C++ whose values are string_view referencing
> external memory, for example.
>
> - Wes
>
> On Thu, Jul 11, 2019 at 10:16 PM Fan Liya <li...@gmail.com> wrote:
> >
> > @Wes McKinney,
> >
> > Thanks a lot for the brainstorming. I think your ideas are reasonable and
> > feasible.
> > About IPC, my idea is that we can send the vector as a
> PointerStringVector,
> > and receive it as a VarCharVector, so that the overhead of memory
> > compaction can be hidden.
> > What do you think?
> >
> > Best,
> > Liya Fan
> >
> > On Fri, Jul 12, 2019 at 11:07 AM Fan Liya <li...@gmail.com> wrote:
> >
> > > @Uwe L. Korn
> > >
> > > Thanks a lot for the suggestion. I think this is exactly what we are
> doing
> > > right now.
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > >> hi Liya -- have you thought about implementing this as an
> > >> ExtensionType / ExtensionVector? You actually can already do this, so
> > >> if this helps you reference strings stored in some external memory
> > >> then that seems reasonable. Such a PointerStringVector could have a
> > >> method that converts it into the Arrow varbinary columnar
> > >> representation.
> > >>
> > >> You wouldn't be able to put such an object into the IPC binary
> > >> protocol, though. If that's a requirement (being able to use the IPC
> > >> protocol) for this kind of data, before going any further in the
> > >> discussion I would suggest that you work out exactly how such data
> > >> would be moved from one process address space to another (using
> > >> Buffers).
> > >>
> > >> - Wes
> > >>
> > >> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote:
> > >> >
> > >> > Hello Liya Fan,
> > >> >
> > >> > here your best approach is to copy into the Arrow format as you can
> > >> then use this as the basis for working with the Arrow-native
> representation
> > >> as well as your internal representation. You will have to use two
> different
> > >> offset vector as those two will always differ but in the case of your
> > >> internal representation, you don't have the requirement of
> consecutive data
> > >> as Arrow has but you can still work with the strings just as before
> even
> > >> when stored consecutively.
> > >> >
> > >> > Uwe
> > >> >
> > >> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> > >> > > Hi Korn,
> > >> > >
> > >> > > Thanks a lot for your comments.
> > >> > >
> > >> > > In my opinion, your comments make sense to me. Allowing
> > >> non-consecutive
> > >> > > memory segments will break some good design choices of Arrow.
> > >> > > However, there are wide-spread user requirements for
> non-consecutive
> > >> memory
> > >> > > segments. I am wondering how can we help such users. What advice
> we
> > >> can
> > >> > > give to them?
> > >> > >
> > >> > > Memory copy/move can be a solution, but is there a better
> solution?
> > >> > > Is there a third alternative? Can we virtualize the
> non-consecutive
> > >> memory
> > >> > > segments into a consecutive one? (Although performance overhead is
> > >> > > unavoidable.)
> > >> > >
> > >> > > What do you think? Let's brain-storm it.
> > >> > >
> > >> > > Best,
> > >> > > Liya Fan
> > >> > >
> > >> > >
> > >> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com>
> wrote:
> > >> > >
> > >> > > > Hello Liya,
> > >> > > >
> > >> > > > I'm quite -1 on this type as Arrow is about efficient columnar
> > >> structures.
> > >> > > > We have opened the standard also to matrix-like types but always
> > >> keep the
> > >> > > > constraint of consecutive memory. Now also adding types where
> > >> memory is no
> > >> > > > longer consecutive but spread in the heap will make the scope
> of the
> > >> > > > project much wider (It seems that we then just turn into a
> general
> > >> > > > serialization framework).
> > >> > > >
> > >> > > > One of the ideas of a common standard is that some need to make
> > >> > > > compromises. I think in this case it is a necessary compromise
> to
> > >> not allow
> > >> > > > all kind of string representations.
> > >> > > >
> > >> > > > Uwe
> > >> > > >
> > >> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > >
> > >> > > > > We are thinking of providing varchar/varbinary vectors with a
> > >> different
> > >> > > > > memory layout which exists in a wide range of systems. The
> memory
> > >> layout
> > >> > > > is
> > >> > > > > different from that of VarCharVector in the following ways:
> > >> > > > >
> > >> > > > >
> > >> > > > >    1.
> > >> > > > >
> > >> > > > >    Instead of storing (start offset, end offset), the new
> layout
> > >> stores
> > >> > > > >    (start offset, length)
> > >> > > > >    2.
> > >> > > > >
> > >> > > > >    The content of varchars may not be in a consecutive memory
> > >> region.
> > >> > > > >    Instead, it can be in arbitrary memory address.
> > >> > > > >
> > >> > > > >
> > >> > > > > Due to these differences in memory layout, it incurs
> performance
> > >> overhead
> > >> > > > > when converting data between existing systems and
> VarCharVectors.
> > >> > > > >
> > >> > > > > The above difference 1 seems insignificant, while difference
> 2 is
> > >> > > > difficult
> > >> > > > > to overcome. However, the scenario of difference 2 is
> prevalent in
> > >> > > > > practice: for example we store strings in a series of memory
> > >> segments.
> > >> > > > > Whenever a segment is full, we request a new one. However,
> these
> > >> memory
> > >> > > > > segments may not be consecutive, because other
> processes/threads
> > >> are also
> > >> > > > > requesting/releasing memory segments in the meantime.
> > >> > > > >
> > >> > > > > So we are wondering if it is possible to support such memory
> > >> layout in
> > >> > > > > Arrow. I think there are more systems that are trying to
> adopting
> > >> Arrow,
> > >> > > > > but are hindered by such difficulty.
> > >> > > > >
> > >> > > > > Would you please give your valuable feedback?
> > >> > > > >
> > >> > > > >
> > >> > > > > Best,
> > >> > > > >
> > >> > > > > Liya Fan
> > >> > > > >
> > >> > > >
> > >> > >
> > >>
> > >
>

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by Wes McKinney <we...@gmail.com>.
hi Liya -- yes, it seems reasonable to defer the conversion from your
pointer-based extension representation to a proper VarCharVector until
you need to send over IPC.

Note that there is no mechanism yet in Java with extension types to
cause a conversion to take place when the IPC step is reached.

I just opened https://issues.apache.org/jira/browse/ARROW-5929 to try
to explain this issue. Let me know if it is not clear

I'm interested to experiment with the same thing in C++. We would have
an ExtensionArray in C++ whose values are string_view referencing
external memory, for example.

- Wes

On Thu, Jul 11, 2019 at 10:16 PM Fan Liya <li...@gmail.com> wrote:
>
> @Wes McKinney,
>
> Thanks a lot for the brainstorming. I think your ideas are reasonable and
> feasible.
> About IPC, my idea is that we can send the vector as a PointerStringVector,
> and receive it as a VarCharVector, so that the overhead of memory
> compaction can be hidden.
> What do you think?
>
> Best,
> Liya Fan
>
> On Fri, Jul 12, 2019 at 11:07 AM Fan Liya <li...@gmail.com> wrote:
>
> > @Uwe L. Korn
> >
> > Thanks a lot for the suggestion. I think this is exactly what we are doing
> > right now.
> >
> > Best,
> > Liya Fan
> >
> > On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney <we...@gmail.com> wrote:
> >
> >> hi Liya -- have you thought about implementing this as an
> >> ExtensionType / ExtensionVector? You actually can already do this, so
> >> if this helps you reference strings stored in some external memory
> >> then that seems reasonable. Such a PointerStringVector could have a
> >> method that converts it into the Arrow varbinary columnar
> >> representation.
> >>
> >> You wouldn't be able to put such an object into the IPC binary
> >> protocol, though. If that's a requirement (being able to use the IPC
> >> protocol) for this kind of data, before going any further in the
> >> discussion I would suggest that you work out exactly how such data
> >> would be moved from one process address space to another (using
> >> Buffers).
> >>
> >> - Wes
> >>
> >> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote:
> >> >
> >> > Hello Liya Fan,
> >> >
> >> > here your best approach is to copy into the Arrow format as you can
> >> then use this as the basis for working with the Arrow-native representation
> >> as well as your internal representation. You will have to use two different
> >> offset vector as those two will always differ but in the case of your
> >> internal representation, you don't have the requirement of consecutive data
> >> as Arrow has but you can still work with the strings just as before even
> >> when stored consecutively.
> >> >
> >> > Uwe
> >> >
> >> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> >> > > Hi Korn,
> >> > >
> >> > > Thanks a lot for your comments.
> >> > >
> >> > > In my opinion, your comments make sense to me. Allowing
> >> non-consecutive
> >> > > memory segments will break some good design choices of Arrow.
> >> > > However, there are wide-spread user requirements for non-consecutive
> >> memory
> >> > > segments. I am wondering how can we help such users. What advice we
> >> can
> >> > > give to them?
> >> > >
> >> > > Memory copy/move can be a solution, but is there a better solution?
> >> > > Is there a third alternative? Can we virtualize the non-consecutive
> >> memory
> >> > > segments into a consecutive one? (Although performance overhead is
> >> > > unavoidable.)
> >> > >
> >> > > What do you think? Let's brain-storm it.
> >> > >
> >> > > Best,
> >> > > Liya Fan
> >> > >
> >> > >
> >> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> >> > >
> >> > > > Hello Liya,
> >> > > >
> >> > > > I'm quite -1 on this type as Arrow is about efficient columnar
> >> structures.
> >> > > > We have opened the standard also to matrix-like types but always
> >> keep the
> >> > > > constraint of consecutive memory. Now also adding types where
> >> memory is no
> >> > > > longer consecutive but spread in the heap will make the scope of the
> >> > > > project much wider (It seems that we then just turn into a general
> >> > > > serialization framework).
> >> > > >
> >> > > > One of the ideas of a common standard is that some need to make
> >> > > > compromises. I think in this case it is a necessary compromise to
> >> not allow
> >> > > > all kind of string representations.
> >> > > >
> >> > > > Uwe
> >> > > >
> >> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> >> > > > > Hi all,
> >> > > > >
> >> > > > >
> >> > > > > We are thinking of providing varchar/varbinary vectors with a
> >> different
> >> > > > > memory layout which exists in a wide range of systems. The memory
> >> layout
> >> > > > is
> >> > > > > different from that of VarCharVector in the following ways:
> >> > > > >
> >> > > > >
> >> > > > >    1.
> >> > > > >
> >> > > > >    Instead of storing (start offset, end offset), the new layout
> >> stores
> >> > > > >    (start offset, length)
> >> > > > >    2.
> >> > > > >
> >> > > > >    The content of varchars may not be in a consecutive memory
> >> region.
> >> > > > >    Instead, it can be in arbitrary memory address.
> >> > > > >
> >> > > > >
> >> > > > > Due to these differences in memory layout, it incurs performance
> >> overhead
> >> > > > > when converting data between existing systems and VarCharVectors.
> >> > > > >
> >> > > > > The above difference 1 seems insignificant, while difference 2 is
> >> > > > difficult
> >> > > > > to overcome. However, the scenario of difference 2 is prevalent in
> >> > > > > practice: for example we store strings in a series of memory
> >> segments.
> >> > > > > Whenever a segment is full, we request a new one. However, these
> >> memory
> >> > > > > segments may not be consecutive, because other processes/threads
> >> are also
> >> > > > > requesting/releasing memory segments in the meantime.
> >> > > > >
> >> > > > > So we are wondering if it is possible to support such memory
> >> layout in
> >> > > > > Arrow. I think there are more systems that are trying to adopting
> >> Arrow,
> >> > > > > but are hindered by such difficulty.
> >> > > > >
> >> > > > > Would you please give your valuable feedback?
> >> > > > >
> >> > > > >
> >> > > > > Best,
> >> > > > >
> >> > > > > Liya Fan
> >> > > > >
> >> > > >
> >> > >
> >>
> >

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by Fan Liya <li...@gmail.com>.
@Wes McKinney,

Thanks a lot for the brainstorming. I think your ideas are reasonable and
feasible.
About IPC, my idea is that we can send the vector as a PointerStringVector,
and receive it as a VarCharVector, so that the overhead of memory
compaction can be hidden.
What do you think?

Best,
Liya Fan

On Fri, Jul 12, 2019 at 11:07 AM Fan Liya <li...@gmail.com> wrote:

> @Uwe L. Korn
>
> Thanks a lot for the suggestion. I think this is exactly what we are doing
> right now.
>
> Best,
> Liya Fan
>
> On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney <we...@gmail.com> wrote:
>
>> hi Liya -- have you thought about implementing this as an
>> ExtensionType / ExtensionVector? You actually can already do this, so
>> if this helps you reference strings stored in some external memory
>> then that seems reasonable. Such a PointerStringVector could have a
>> method that converts it into the Arrow varbinary columnar
>> representation.
>>
>> You wouldn't be able to put such an object into the IPC binary
>> protocol, though. If that's a requirement (being able to use the IPC
>> protocol) for this kind of data, before going any further in the
>> discussion I would suggest that you work out exactly how such data
>> would be moved from one process address space to another (using
>> Buffers).
>>
>> - Wes
>>
>> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>> >
>> > Hello Liya Fan,
>> >
>> > here your best approach is to copy into the Arrow format as you can
>> then use this as the basis for working with the Arrow-native representation
>> as well as your internal representation. You will have to use two different
>> offset vector as those two will always differ but in the case of your
>> internal representation, you don't have the requirement of consecutive data
>> as Arrow has but you can still work with the strings just as before even
>> when stored consecutively.
>> >
>> > Uwe
>> >
>> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
>> > > Hi Korn,
>> > >
>> > > Thanks a lot for your comments.
>> > >
>> > > In my opinion, your comments make sense to me. Allowing
>> non-consecutive
>> > > memory segments will break some good design choices of Arrow.
>> > > However, there are wide-spread user requirements for non-consecutive
>> memory
>> > > segments. I am wondering how can we help such users. What advice we
>> can
>> > > give to them?
>> > >
>> > > Memory copy/move can be a solution, but is there a better solution?
>> > > Is there a third alternative? Can we virtualize the non-consecutive
>> memory
>> > > segments into a consecutive one? (Although performance overhead is
>> > > unavoidable.)
>> > >
>> > > What do you think? Let's brain-storm it.
>> > >
>> > > Best,
>> > > Liya Fan
>> > >
>> > >
>> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:
>> > >
>> > > > Hello Liya,
>> > > >
>> > > > I'm quite -1 on this type as Arrow is about efficient columnar
>> structures.
>> > > > We have opened the standard also to matrix-like types but always
>> keep the
>> > > > constraint of consecutive memory. Now also adding types where
>> memory is no
>> > > > longer consecutive but spread in the heap will make the scope of the
>> > > > project much wider (It seems that we then just turn into a general
>> > > > serialization framework).
>> > > >
>> > > > One of the ideas of a common standard is that some need to make
>> > > > compromises. I think in this case it is a necessary compromise to
>> not allow
>> > > > all kind of string representations.
>> > > >
>> > > > Uwe
>> > > >
>> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
>> > > > > Hi all,
>> > > > >
>> > > > >
>> > > > > We are thinking of providing varchar/varbinary vectors with a
>> different
>> > > > > memory layout which exists in a wide range of systems. The memory
>> layout
>> > > > is
>> > > > > different from that of VarCharVector in the following ways:
>> > > > >
>> > > > >
>> > > > >    1.
>> > > > >
>> > > > >    Instead of storing (start offset, end offset), the new layout
>> stores
>> > > > >    (start offset, length)
>> > > > >    2.
>> > > > >
>> > > > >    The content of varchars may not be in a consecutive memory
>> region.
>> > > > >    Instead, it can be in arbitrary memory address.
>> > > > >
>> > > > >
>> > > > > Due to these differences in memory layout, it incurs performance
>> overhead
>> > > > > when converting data between existing systems and VarCharVectors.
>> > > > >
>> > > > > The above difference 1 seems insignificant, while difference 2 is
>> > > > difficult
>> > > > > to overcome. However, the scenario of difference 2 is prevalent in
>> > > > > practice: for example we store strings in a series of memory
>> segments.
>> > > > > Whenever a segment is full, we request a new one. However, these
>> memory
>> > > > > segments may not be consecutive, because other processes/threads
>> are also
>> > > > > requesting/releasing memory segments in the meantime.
>> > > > >
>> > > > > So we are wondering if it is possible to support such memory
>> layout in
>> > > > > Arrow. I think there are more systems that are trying to adopting
>> Arrow,
>> > > > > but are hindered by such difficulty.
>> > > > >
>> > > > > Would you please give your valuable feedback?
>> > > > >
>> > > > >
>> > > > > Best,
>> > > > >
>> > > > > Liya Fan
>> > > > >
>> > > >
>> > >
>>
>

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by Fan Liya <li...@gmail.com>.
@Uwe L. Korn

Thanks a lot for the suggestion. I think this is exactly what we are doing
right now.

Best,
Liya Fan

On Thu, Jul 11, 2019 at 9:44 PM Wes McKinney <we...@gmail.com> wrote:

> hi Liya -- have you thought about implementing this as an
> ExtensionType / ExtensionVector? You actually can already do this, so
> if this helps you reference strings stored in some external memory
> then that seems reasonable. Such a PointerStringVector could have a
> method that converts it into the Arrow varbinary columnar
> representation.
>
> You wouldn't be able to put such an object into the IPC binary
> protocol, though. If that's a requirement (being able to use the IPC
> protocol) for this kind of data, before going any further in the
> discussion I would suggest that you work out exactly how such data
> would be moved from one process address space to another (using
> Buffers).
>
> - Wes
>
> On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > Hello Liya Fan,
> >
> > here your best approach is to copy into the Arrow format as you can then
> use this as the basis for working with the Arrow-native representation as
> well as your internal representation. You will have to use two different
> offset vector as those two will always differ but in the case of your
> internal representation, you don't have the requirement of consecutive data
> as Arrow has but you can still work with the strings just as before even
> when stored consecutively.
> >
> > Uwe
> >
> > On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> > > Hi Korn,
> > >
> > > Thanks a lot for your comments.
> > >
> > > In my opinion, your comments make sense to me. Allowing non-consecutive
> > > memory segments will break some good design choices of Arrow.
> > > However, there are wide-spread user requirements for non-consecutive
> memory
> > > segments. I am wondering how can we help such users. What advice we can
> > > give to them?
> > >
> > > Memory copy/move can be a solution, but is there a better solution?
> > > Is there a third alternative? Can we virtualize the non-consecutive
> memory
> > > segments into a consecutive one? (Although performance overhead is
> > > unavoidable.)
> > >
> > > What do you think? Let's brain-storm it.
> > >
> > > Best,
> > > Liya Fan
> > >
> > >
> > > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> > >
> > > > Hello Liya,
> > > >
> > > > I'm quite -1 on this type as Arrow is about efficient columnar
> structures.
> > > > We have opened the standard also to matrix-like types but always
> keep the
> > > > constraint of consecutive memory. Now also adding types where memory
> is no
> > > > longer consecutive but spread in the heap will make the scope of the
> > > > project much wider (It seems that we then just turn into a general
> > > > serialization framework).
> > > >
> > > > One of the ideas of a common standard is that some need to make
> > > > compromises. I think in this case it is a necessary compromise to
> not allow
> > > > all kind of string representations.
> > > >
> > > > Uwe
> > > >
> > > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > > > Hi all,
> > > > >
> > > > >
> > > > > We are thinking of providing varchar/varbinary vectors with a
> different
> > > > > memory layout which exists in a wide range of systems. The memory
> layout
> > > > is
> > > > > different from that of VarCharVector in the following ways:
> > > > >
> > > > >
> > > > >    1.
> > > > >
> > > > >    Instead of storing (start offset, end offset), the new layout
> stores
> > > > >    (start offset, length)
> > > > >    2.
> > > > >
> > > > >    The content of varchars may not be in a consecutive memory
> region.
> > > > >    Instead, it can be in arbitrary memory address.
> > > > >
> > > > >
> > > > > Due to these differences in memory layout, it incurs performance
> overhead
> > > > > when converting data between existing systems and VarCharVectors.
> > > > >
> > > > > The above difference 1 seems insignificant, while difference 2 is
> > > > difficult
> > > > > to overcome. However, the scenario of difference 2 is prevalent in
> > > > > practice: for example we store strings in a series of memory
> segments.
> > > > > Whenever a segment is full, we request a new one. However, these
> memory
> > > > > segments may not be consecutive, because other processes/threads
> are also
> > > > > requesting/releasing memory segments in the meantime.
> > > > >
> > > > > So we are wondering if it is possible to support such memory
> layout in
> > > > > Arrow. I think there are more systems that are trying to adopting
> Arrow,
> > > > > but are hindered by such difficulty.
> > > > >
> > > > > Would you please give your valuable feedback?
> > > > >
> > > > >
> > > > > Best,
> > > > >
> > > > > Liya Fan
> > > > >
> > > >
> > >
>

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by Wes McKinney <we...@gmail.com>.
hi Liya -- have you thought about implementing this as an
ExtensionType / ExtensionVector? You actually can already do this, so
if this helps you reference strings stored in some external memory
then that seems reasonable. Such a PointerStringVector could have a
method that converts it into the Arrow varbinary columnar
representation.

You wouldn't be able to put such an object into the IPC binary
protocol, though. If that's a requirement (being able to use the IPC
protocol) for this kind of data, before going any further in the
discussion I would suggest that you work out exactly how such data
would be moved from one process address space to another (using
Buffers).

- Wes

On Thu, Jul 11, 2019 at 7:35 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>
> Hello Liya Fan,
>
> here your best approach is to copy into the Arrow format as you can then use this as the basis for working with the Arrow-native representation as well as your internal representation. You will have to use two different offset vector as those two will always differ but in the case of your internal representation, you don't have the requirement of consecutive data as Arrow has but you can still work with the strings just as before even when stored consecutively.
>
> Uwe
>
> On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> > Hi Korn,
> >
> > Thanks a lot for your comments.
> >
> > In my opinion, your comments make sense to me. Allowing non-consecutive
> > memory segments will break some good design choices of Arrow.
> > However, there are wide-spread user requirements for non-consecutive memory
> > segments. I am wondering how can we help such users. What advice we can
> > give to them?
> >
> > Memory copy/move can be a solution, but is there a better solution?
> > Is there a third alternative? Can we virtualize the non-consecutive memory
> > segments into a consecutive one? (Although performance overhead is
> > unavoidable.)
> >
> > What do you think? Let's brain-storm it.
> >
> > Best,
> > Liya Fan
> >
> >
> > On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > > Hello Liya,
> > >
> > > I'm quite -1 on this type as Arrow is about efficient columnar structures.
> > > We have opened the standard also to matrix-like types but always keep the
> > > constraint of consecutive memory. Now also adding types where memory is no
> > > longer consecutive but spread in the heap will make the scope of the
> > > project much wider (It seems that we then just turn into a general
> > > serialization framework).
> > >
> > > One of the ideas of a common standard is that some need to make
> > > compromises. I think in this case it is a necessary compromise to not allow
> > > all kind of string representations.
> > >
> > > Uwe
> > >
> > > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > > Hi all,
> > > >
> > > >
> > > > We are thinking of providing varchar/varbinary vectors with a different
> > > > memory layout which exists in a wide range of systems. The memory layout
> > > is
> > > > different from that of VarCharVector in the following ways:
> > > >
> > > >
> > > >    1.
> > > >
> > > >    Instead of storing (start offset, end offset), the new layout stores
> > > >    (start offset, length)
> > > >    2.
> > > >
> > > >    The content of varchars may not be in a consecutive memory region.
> > > >    Instead, it can be in arbitrary memory address.
> > > >
> > > >
> > > > Due to these differences in memory layout, it incurs performance overhead
> > > > when converting data between existing systems and VarCharVectors.
> > > >
> > > > The above difference 1 seems insignificant, while difference 2 is
> > > difficult
> > > > to overcome. However, the scenario of difference 2 is prevalent in
> > > > practice: for example we store strings in a series of memory segments.
> > > > Whenever a segment is full, we request a new one. However, these memory
> > > > segments may not be consecutive, because other processes/threads are also
> > > > requesting/releasing memory segments in the meantime.
> > > >
> > > > So we are wondering if it is possible to support such memory layout in
> > > > Arrow. I think there are more systems that are trying to adopting Arrow,
> > > > but are hindered by such difficulty.
> > > >
> > > > Would you please give your valuable feedback?
> > > >
> > > >
> > > > Best,
> > > >
> > > > Liya Fan
> > > >
> > >
> >

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Liya Fan,

here your best approach is to copy into the Arrow format as you can then use this as the basis for working with the Arrow-native representation as well as your internal representation. You will have to use two different offset vector as those two will always differ but in the case of your internal representation, you don't have the requirement of consecutive data as Arrow has but you can still work with the strings just as before even when stored consecutively.

Uwe

On Thu, Jul 11, 2019, at 2:24 PM, Fan Liya wrote:
> Hi Korn,
> 
> Thanks a lot for your comments.
> 
> In my opinion, your comments make sense to me. Allowing non-consecutive
> memory segments will break some good design choices of Arrow.
> However, there are wide-spread user requirements for non-consecutive memory
> segments. I am wondering how can we help such users. What advice we can
> give to them?
> 
> Memory copy/move can be a solution, but is there a better solution?
> Is there a third alternative? Can we virtualize the non-consecutive memory
> segments into a consecutive one? (Although performance overhead is
> unavoidable.)
> 
> What do you think? Let's brain-storm it.
> 
> Best,
> Liya Fan
> 
> 
> On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> 
> > Hello Liya,
> >
> > I'm quite -1 on this type as Arrow is about efficient columnar structures.
> > We have opened the standard also to matrix-like types but always keep the
> > constraint of consecutive memory. Now also adding types where memory is no
> > longer consecutive but spread in the heap will make the scope of the
> > project much wider (It seems that we then just turn into a general
> > serialization framework).
> >
> > One of the ideas of a common standard is that some need to make
> > compromises. I think in this case it is a necessary compromise to not allow
> > all kind of string representations.
> >
> > Uwe
> >
> > On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > > Hi all,
> > >
> > >
> > > We are thinking of providing varchar/varbinary vectors with a different
> > > memory layout which exists in a wide range of systems. The memory layout
> > is
> > > different from that of VarCharVector in the following ways:
> > >
> > >
> > >    1.
> > >
> > >    Instead of storing (start offset, end offset), the new layout stores
> > >    (start offset, length)
> > >    2.
> > >
> > >    The content of varchars may not be in a consecutive memory region.
> > >    Instead, it can be in arbitrary memory address.
> > >
> > >
> > > Due to these differences in memory layout, it incurs performance overhead
> > > when converting data between existing systems and VarCharVectors.
> > >
> > > The above difference 1 seems insignificant, while difference 2 is
> > difficult
> > > to overcome. However, the scenario of difference 2 is prevalent in
> > > practice: for example we store strings in a series of memory segments.
> > > Whenever a segment is full, we request a new one. However, these memory
> > > segments may not be consecutive, because other processes/threads are also
> > > requesting/releasing memory segments in the meantime.
> > >
> > > So we are wondering if it is possible to support such memory layout in
> > > Arrow. I think there are more systems that are trying to adopting Arrow,
> > > but are hindered by such difficulty.
> > >
> > > Would you please give your valuable feedback?
> > >
> > >
> > > Best,
> > >
> > > Liya Fan
> > >
> >
>

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by Fan Liya <li...@gmail.com>.
Hi Korn,

Thanks a lot for your comments.

In my opinion, your comments make sense to me. Allowing non-consecutive
memory segments will break some good design choices of Arrow.
However, there are wide-spread user requirements for non-consecutive memory
segments. I am wondering how can we help such users. What advice we can
give to them?

Memory copy/move can be a solution, but is there a better solution?
Is there a third alternative? Can we virtualize the non-consecutive memory
segments into a consecutive one? (Although performance overhead is
unavoidable.)

What do you think? Let's brain-storm it.

Best,
Liya Fan


On Thu, Jul 11, 2019 at 8:05 PM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Liya,
>
> I'm quite -1 on this type as Arrow is about efficient columnar structures.
> We have opened the standard also to matrix-like types but always keep the
> constraint of consecutive memory. Now also adding types where memory is no
> longer consecutive but spread in the heap will make the scope of the
> project much wider (It seems that we then just turn into a general
> serialization framework).
>
> One of the ideas of a common standard is that some need to make
> compromises. I think in this case it is a necessary compromise to not allow
> all kind of string representations.
>
> Uwe
>
> On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> > Hi all,
> >
> >
> > We are thinking of providing varchar/varbinary vectors with a different
> > memory layout which exists in a wide range of systems. The memory layout
> is
> > different from that of VarCharVector in the following ways:
> >
> >
> >    1.
> >
> >    Instead of storing (start offset, end offset), the new layout stores
> >    (start offset, length)
> >    2.
> >
> >    The content of varchars may not be in a consecutive memory region.
> >    Instead, it can be in arbitrary memory address.
> >
> >
> > Due to these differences in memory layout, it incurs performance overhead
> > when converting data between existing systems and VarCharVectors.
> >
> > The above difference 1 seems insignificant, while difference 2 is
> difficult
> > to overcome. However, the scenario of difference 2 is prevalent in
> > practice: for example we store strings in a series of memory segments.
> > Whenever a segment is full, we request a new one. However, these memory
> > segments may not be consecutive, because other processes/threads are also
> > requesting/releasing memory segments in the meantime.
> >
> > So we are wondering if it is possible to support such memory layout in
> > Arrow. I think there are more systems that are trying to adopting Arrow,
> > but are hindered by such difficulty.
> >
> > Would you please give your valuable feedback?
> >
> >
> > Best,
> >
> > Liya Fan
> >
>

Re: [Discuss] Support an alternative memory layout for varchar/varbinary vectors

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Liya,

I'm quite -1 on this type as Arrow is about efficient columnar structures. We have opened the standard also to matrix-like types but always keep the constraint of consecutive memory. Now also adding types where memory is no longer consecutive but spread in the heap will make the scope of the project much wider (It seems that we then just turn into a general serialization framework).

One of the ideas of a common standard is that some need to make compromises. I think in this case it is a necessary compromise to not allow all kind of string representations.

Uwe

On Thu, Jul 11, 2019, at 6:01 AM, Fan Liya wrote:
> Hi all,
> 
> 
> We are thinking of providing varchar/varbinary vectors with a different
> memory layout which exists in a wide range of systems. The memory layout is
> different from that of VarCharVector in the following ways:
> 
> 
>    1.
> 
>    Instead of storing (start offset, end offset), the new layout stores
>    (start offset, length)
>    2.
> 
>    The content of varchars may not be in a consecutive memory region.
>    Instead, it can be in arbitrary memory address.
> 
> 
> Due to these differences in memory layout, it incurs performance overhead
> when converting data between existing systems and VarCharVectors.
> 
> The above difference 1 seems insignificant, while difference 2 is difficult
> to overcome. However, the scenario of difference 2 is prevalent in
> practice: for example we store strings in a series of memory segments.
> Whenever a segment is full, we request a new one. However, these memory
> segments may not be consecutive, because other processes/threads are also
> requesting/releasing memory segments in the meantime.
> 
> So we are wondering if it is possible to support such memory layout in
> Arrow. I think there are more systems that are trying to adopting Arrow,
> but are hindered by such difficulty.
> 
> Would you please give your valuable feedback?
> 
> 
> Best,
> 
> Liya Fan
>