You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ji Liu <ti...@apache.org> on 2020/08/04 07:55:34 UTC

[DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

Hi all,


When I worked on ARROW-7539[1], I met some problems and not sure what's the
proper way to solve it.


This issue was about to avoid set reader/writer indices in
FieldVector#getFieldBuffers according to the following reasons:

i. getBuffers set reader/writer indices and it's right for the purpose of
sending the data over the wire

ii. getFieldBuffers is distinct from getBuffers, it should be for getting
access to underlying data for higher-performance algorithms


Currently in VectorUnloader, we used getFieldBuffers to create
ArrowRecordBatch that's why we keep writer/reader indices in getFieldBuffers
, we should use getBuffers instead. But during the change, we found another
problem:

The order of validity and offset buffers are not in the same order in
ListVector(getBuffers's order is right), changing the API in VectorUnloader
creates problems with serialization/deserialization resulting in test
failures in Dremio which would break backward compatibility with existing
serialised files.


Micah gives a solution but seems doesn't reach consistent in the PR thread[2
]:

   1. Remove setReaderWriterIndeces in getFieldBuffers
   2. Deprecate getBuffers
   3. Introduce a new getIpcBuffers which is unambiguously used for writing
   record batches (i.e. in VectorUnloader).
   4. Update documentation where it makes sense based on all this
   conversation.


More details and discussions can be seen from the PR and hope to get more
feedback.



Thanks,

Ji Liu



[1] https://issues.apache.org/jira/browse/ARROW-7539

[2] https://github.com/apache/arrow/pull/6156

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

Posted by Jacques Nadeau <ja...@apache.org>.
Per my comments there, the introduction of field buffers was added as part
of the fieldvector addition when we have vectors that weren't field level.
This meant that getbuffers and getfieldbuffers were at different levels at
hierarchy (getbuffers being more general). I believe we no longer have the
case of a vector that is not a field. As such, collapsing fieldvector and
valuevector makes a lot of sense.

In the thread I tried to say that we should capture the use cases and then
consolidate to support those use cases. I think there are definitely two
use cases (with sub in one of them) that I'm aware of:

- get me buffers for writing to a io (socket or disk)
  - transfer the buffers
  - keep the buffers in vector (double reference counts)
- get me the buffers so I can do some kind of low level memory operations
(cheap as possible please)

I actually think we've also failed to complete the other work that we
talked about wrt removing readerIndex and writerIndex from ArrowBuf. They
aren't well maintained by any of the vectors and thus are mostly broken
(and create a bunch of this confusion).

So I'd actually be inclined to fix the readerIndex/writerIndex issue before
we mess with the buffer methods. Removing that whole concept will collapse
the use cases, I believe. Then we only have transfer versus not (which is
what the getBuffers() method already provides).


On Mon, Aug 10, 2020 at 1:44 AM Ji Liu <ti...@apache.org> wrote:

> Hi Micah, I am afraid it's not a reasonable solution.
> 1. The status is that getFieldBuffers has right order buffer and was used
> in IPC, getBuffers was not used in IPC.
> 2. The purpose of this PR is to use getBuffers in IPC instead, and making
> changes in getFieldBuffers dose not seem to help this problem since it will
> break IPC format by using getBuffers.
>
> Micah Kornfield <em...@gmail.com> 于2020年8月8日周六 上午11:50写道:
>
> >  Thinking about this some more, I think maybe we should also potentially
> > try to deprecate hold off on any changes to getFieldBuffers.  It should
> > likely follow the same sort of pattern for getBuffers (create a new
> method
> > that doesn't set reader/writer indices and deprecate getFieldBuffers).
> But
> > I think that can be handled in a separate PR?
> >
> > Anybody else have thoughts?
> >
> > -Micah
> >
> > On Tue, Aug 4, 2020 at 11:24 PM Ji Liu <ti...@apache.org> wrote:
> >
> > > hi liya,
> > > Thanks for your careful review, it is a typo, the order of getBuffers
> is
> > > wrong.
> > >
> > > Fan Liya <li...@gmail.com> 于2020年8月5日周三 下午2:14写道:
> > >
> > > > Hi Ji,
> > > >
> > > > IMO, for the correct order, the validity buffer should precede the
> > offset
> > > > buffer (e.g. this is the order used by BaseVariableWidthVector &
> > > > BaseLargeVariableWidthVector).
> > > > In ListVector#getBuffers, the offset buffer precedes the validity
> > buffer,
> > > > so I am a little confused why you say the order of
> > ListVector#getBuffers
> > > is
> > > > right?
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield <
> emkornfield@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > FWIW, I lack historical context on how these methods evolved, so
> I'd
> > > > > appreciate insight from anyone who has worked on the java codebase
> > for
> > > a
> > > > > longer period of time.  The current situation seems less then
> ideal.
> > > > >
> > > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu <ti...@apache.org>
> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > >
> > > > > > When I worked on ARROW-7539[1], I met some problems and not sure
> > > what's
> > > > > the
> > > > > > proper way to solve it.
> > > > > >
> > > > > >
> > > > > > This issue was about to avoid set reader/writer indices in
> > > > > > FieldVector#getFieldBuffers according to the following reasons:
> > > > > >
> > > > > > i. getBuffers set reader/writer indices and it's right for the
> > > purpose
> > > > of
> > > > > > sending the data over the wire
> > > > > >
> > > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for
> > > > getting
> > > > > > access to underlying data for higher-performance algorithms
> > > > > >
> > > > > >
> > > > > > Currently in VectorUnloader, we used getFieldBuffers to create
> > > > > > ArrowRecordBatch that's why we keep writer/reader indices in
> > > > > > getFieldBuffers
> > > > > > , we should use getBuffers instead. But during the change, we
> found
> > > > > another
> > > > > > problem:
> > > > > >
> > > > > > The order of validity and offset buffers are not in the same
> order
> > in
> > > > > > ListVector(getBuffers's order is right), changing the API in
> > > > > VectorUnloader
> > > > > > creates problems with serialization/deserialization resulting in
> > test
> > > > > > failures in Dremio which would break backward compatibility with
> > > > existing
> > > > > > serialised files.
> > > > > >
> > > > > >
> > > > > > Micah gives a solution but seems doesn't reach consistent in the
> PR
> > > > > > thread[2
> > > > > > ]:
> > > > > >
> > > > > >    1. Remove setReaderWriterIndeces in getFieldBuffers
> > > > > >    2. Deprecate getBuffers
> > > > > >    3. Introduce a new getIpcBuffers which is unambiguously used
> for
> > > > > writing
> > > > > >    record batches (i.e. in VectorUnloader).
> > > > > >    4. Update documentation where it makes sense based on all this
> > > > > >    conversation.
> > > > > >
> > > > > >
> > > > > > More details and discussions can be seen from the PR and hope to
> > get
> > > > more
> > > > > > feedback.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Ji Liu
> > > > > >
> > > > > >
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539
> > > > > >
> > > > > > [2] https://github.com/apache/arrow/pull/6156
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

Posted by Ji Liu <ti...@apache.org>.
Hi Micah, I am afraid it's not a reasonable solution.
1. The status is that getFieldBuffers has right order buffer and was used
in IPC, getBuffers was not used in IPC.
2. The purpose of this PR is to use getBuffers in IPC instead, and making
changes in getFieldBuffers dose not seem to help this problem since it will
break IPC format by using getBuffers.

Micah Kornfield <em...@gmail.com> 于2020年8月8日周六 上午11:50写道:

>  Thinking about this some more, I think maybe we should also potentially
> try to deprecate hold off on any changes to getFieldBuffers.  It should
> likely follow the same sort of pattern for getBuffers (create a new method
> that doesn't set reader/writer indices and deprecate getFieldBuffers).  But
> I think that can be handled in a separate PR?
>
> Anybody else have thoughts?
>
> -Micah
>
> On Tue, Aug 4, 2020 at 11:24 PM Ji Liu <ti...@apache.org> wrote:
>
> > hi liya,
> > Thanks for your careful review, it is a typo, the order of getBuffers is
> > wrong.
> >
> > Fan Liya <li...@gmail.com> 于2020年8月5日周三 下午2:14写道:
> >
> > > Hi Ji,
> > >
> > > IMO, for the correct order, the validity buffer should precede the
> offset
> > > buffer (e.g. this is the order used by BaseVariableWidthVector &
> > > BaseLargeVariableWidthVector).
> > > In ListVector#getBuffers, the offset buffer precedes the validity
> buffer,
> > > so I am a little confused why you say the order of
> ListVector#getBuffers
> > is
> > > right?
> > >
> > > Best,
> > > Liya Fan
> > >
> > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > >
> > > > FWIW, I lack historical context on how these methods evolved, so I'd
> > > > appreciate insight from anyone who has worked on the java codebase
> for
> > a
> > > > longer period of time.  The current situation seems less then ideal.
> > > >
> > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu <ti...@apache.org> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > >
> > > > > When I worked on ARROW-7539[1], I met some problems and not sure
> > what's
> > > > the
> > > > > proper way to solve it.
> > > > >
> > > > >
> > > > > This issue was about to avoid set reader/writer indices in
> > > > > FieldVector#getFieldBuffers according to the following reasons:
> > > > >
> > > > > i. getBuffers set reader/writer indices and it's right for the
> > purpose
> > > of
> > > > > sending the data over the wire
> > > > >
> > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for
> > > getting
> > > > > access to underlying data for higher-performance algorithms
> > > > >
> > > > >
> > > > > Currently in VectorUnloader, we used getFieldBuffers to create
> > > > > ArrowRecordBatch that's why we keep writer/reader indices in
> > > > > getFieldBuffers
> > > > > , we should use getBuffers instead. But during the change, we found
> > > > another
> > > > > problem:
> > > > >
> > > > > The order of validity and offset buffers are not in the same order
> in
> > > > > ListVector(getBuffers's order is right), changing the API in
> > > > VectorUnloader
> > > > > creates problems with serialization/deserialization resulting in
> test
> > > > > failures in Dremio which would break backward compatibility with
> > > existing
> > > > > serialised files.
> > > > >
> > > > >
> > > > > Micah gives a solution but seems doesn't reach consistent in the PR
> > > > > thread[2
> > > > > ]:
> > > > >
> > > > >    1. Remove setReaderWriterIndeces in getFieldBuffers
> > > > >    2. Deprecate getBuffers
> > > > >    3. Introduce a new getIpcBuffers which is unambiguously used for
> > > > writing
> > > > >    record batches (i.e. in VectorUnloader).
> > > > >    4. Update documentation where it makes sense based on all this
> > > > >    conversation.
> > > > >
> > > > >
> > > > > More details and discussions can be seen from the PR and hope to
> get
> > > more
> > > > > feedback.
> > > > >
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Ji Liu
> > > > >
> > > > >
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539
> > > > >
> > > > > [2] https://github.com/apache/arrow/pull/6156
> > > > >
> > > >
> > >
> >
>

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

Posted by Micah Kornfield <em...@gmail.com>.
 Thinking about this some more, I think maybe we should also potentially
try to deprecate hold off on any changes to getFieldBuffers.  It should
likely follow the same sort of pattern for getBuffers (create a new method
that doesn't set reader/writer indices and deprecate getFieldBuffers).  But
I think that can be handled in a separate PR?

Anybody else have thoughts?

-Micah

On Tue, Aug 4, 2020 at 11:24 PM Ji Liu <ti...@apache.org> wrote:

> hi liya,
> Thanks for your careful review, it is a typo, the order of getBuffers is
> wrong.
>
> Fan Liya <li...@gmail.com> 于2020年8月5日周三 下午2:14写道:
>
> > Hi Ji,
> >
> > IMO, for the correct order, the validity buffer should precede the offset
> > buffer (e.g. this is the order used by BaseVariableWidthVector &
> > BaseLargeVariableWidthVector).
> > In ListVector#getBuffers, the offset buffer precedes the validity buffer,
> > so I am a little confused why you say the order of ListVector#getBuffers
> is
> > right?
> >
> > Best,
> > Liya Fan
> >
> > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > FWIW, I lack historical context on how these methods evolved, so I'd
> > > appreciate insight from anyone who has worked on the java codebase for
> a
> > > longer period of time.  The current situation seems less then ideal.
> > >
> > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu <ti...@apache.org> wrote:
> > >
> > > > Hi all,
> > > >
> > > >
> > > > When I worked on ARROW-7539[1], I met some problems and not sure
> what's
> > > the
> > > > proper way to solve it.
> > > >
> > > >
> > > > This issue was about to avoid set reader/writer indices in
> > > > FieldVector#getFieldBuffers according to the following reasons:
> > > >
> > > > i. getBuffers set reader/writer indices and it's right for the
> purpose
> > of
> > > > sending the data over the wire
> > > >
> > > > ii. getFieldBuffers is distinct from getBuffers, it should be for
> > getting
> > > > access to underlying data for higher-performance algorithms
> > > >
> > > >
> > > > Currently in VectorUnloader, we used getFieldBuffers to create
> > > > ArrowRecordBatch that's why we keep writer/reader indices in
> > > > getFieldBuffers
> > > > , we should use getBuffers instead. But during the change, we found
> > > another
> > > > problem:
> > > >
> > > > The order of validity and offset buffers are not in the same order in
> > > > ListVector(getBuffers's order is right), changing the API in
> > > VectorUnloader
> > > > creates problems with serialization/deserialization resulting in test
> > > > failures in Dremio which would break backward compatibility with
> > existing
> > > > serialised files.
> > > >
> > > >
> > > > Micah gives a solution but seems doesn't reach consistent in the PR
> > > > thread[2
> > > > ]:
> > > >
> > > >    1. Remove setReaderWriterIndeces in getFieldBuffers
> > > >    2. Deprecate getBuffers
> > > >    3. Introduce a new getIpcBuffers which is unambiguously used for
> > > writing
> > > >    record batches (i.e. in VectorUnloader).
> > > >    4. Update documentation where it makes sense based on all this
> > > >    conversation.
> > > >
> > > >
> > > > More details and discussions can be seen from the PR and hope to get
> > more
> > > > feedback.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Ji Liu
> > > >
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/ARROW-7539
> > > >
> > > > [2] https://github.com/apache/arrow/pull/6156
> > > >
> > >
> >
>

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

Posted by Ji Liu <ti...@apache.org>.
hi liya,
Thanks for your careful review, it is a typo, the order of getBuffers is
wrong.

Fan Liya <li...@gmail.com> 于2020年8月5日周三 下午2:14写道:

> Hi Ji,
>
> IMO, for the correct order, the validity buffer should precede the offset
> buffer (e.g. this is the order used by BaseVariableWidthVector &
> BaseLargeVariableWidthVector).
> In ListVector#getBuffers, the offset buffer precedes the validity buffer,
> so I am a little confused why you say the order of ListVector#getBuffers is
> right?
>
> Best,
> Liya Fan
>
> On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > FWIW, I lack historical context on how these methods evolved, so I'd
> > appreciate insight from anyone who has worked on the java codebase for a
> > longer period of time.  The current situation seems less then ideal.
> >
> > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu <ti...@apache.org> wrote:
> >
> > > Hi all,
> > >
> > >
> > > When I worked on ARROW-7539[1], I met some problems and not sure what's
> > the
> > > proper way to solve it.
> > >
> > >
> > > This issue was about to avoid set reader/writer indices in
> > > FieldVector#getFieldBuffers according to the following reasons:
> > >
> > > i. getBuffers set reader/writer indices and it's right for the purpose
> of
> > > sending the data over the wire
> > >
> > > ii. getFieldBuffers is distinct from getBuffers, it should be for
> getting
> > > access to underlying data for higher-performance algorithms
> > >
> > >
> > > Currently in VectorUnloader, we used getFieldBuffers to create
> > > ArrowRecordBatch that's why we keep writer/reader indices in
> > > getFieldBuffers
> > > , we should use getBuffers instead. But during the change, we found
> > another
> > > problem:
> > >
> > > The order of validity and offset buffers are not in the same order in
> > > ListVector(getBuffers's order is right), changing the API in
> > VectorUnloader
> > > creates problems with serialization/deserialization resulting in test
> > > failures in Dremio which would break backward compatibility with
> existing
> > > serialised files.
> > >
> > >
> > > Micah gives a solution but seems doesn't reach consistent in the PR
> > > thread[2
> > > ]:
> > >
> > >    1. Remove setReaderWriterIndeces in getFieldBuffers
> > >    2. Deprecate getBuffers
> > >    3. Introduce a new getIpcBuffers which is unambiguously used for
> > writing
> > >    record batches (i.e. in VectorUnloader).
> > >    4. Update documentation where it makes sense based on all this
> > >    conversation.
> > >
> > >
> > > More details and discussions can be seen from the PR and hope to get
> more
> > > feedback.
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Ji Liu
> > >
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-7539
> > >
> > > [2] https://github.com/apache/arrow/pull/6156
> > >
> >
>

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

Posted by Fan Liya <li...@gmail.com>.
Hi Ji,

IMO, for the correct order, the validity buffer should precede the offset
buffer (e.g. this is the order used by BaseVariableWidthVector &
BaseLargeVariableWidthVector).
In ListVector#getBuffers, the offset buffer precedes the validity buffer,
so I am a little confused why you say the order of ListVector#getBuffers is
right?

Best,
Liya Fan

On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield <em...@gmail.com>
wrote:

> FWIW, I lack historical context on how these methods evolved, so I'd
> appreciate insight from anyone who has worked on the java codebase for a
> longer period of time.  The current situation seems less then ideal.
>
> On Tue, Aug 4, 2020 at 12:55 AM Ji Liu <ti...@apache.org> wrote:
>
> > Hi all,
> >
> >
> > When I worked on ARROW-7539[1], I met some problems and not sure what's
> the
> > proper way to solve it.
> >
> >
> > This issue was about to avoid set reader/writer indices in
> > FieldVector#getFieldBuffers according to the following reasons:
> >
> > i. getBuffers set reader/writer indices and it's right for the purpose of
> > sending the data over the wire
> >
> > ii. getFieldBuffers is distinct from getBuffers, it should be for getting
> > access to underlying data for higher-performance algorithms
> >
> >
> > Currently in VectorUnloader, we used getFieldBuffers to create
> > ArrowRecordBatch that's why we keep writer/reader indices in
> > getFieldBuffers
> > , we should use getBuffers instead. But during the change, we found
> another
> > problem:
> >
> > The order of validity and offset buffers are not in the same order in
> > ListVector(getBuffers's order is right), changing the API in
> VectorUnloader
> > creates problems with serialization/deserialization resulting in test
> > failures in Dremio which would break backward compatibility with existing
> > serialised files.
> >
> >
> > Micah gives a solution but seems doesn't reach consistent in the PR
> > thread[2
> > ]:
> >
> >    1. Remove setReaderWriterIndeces in getFieldBuffers
> >    2. Deprecate getBuffers
> >    3. Introduce a new getIpcBuffers which is unambiguously used for
> writing
> >    record batches (i.e. in VectorUnloader).
> >    4. Update documentation where it makes sense based on all this
> >    conversation.
> >
> >
> > More details and discussions can be seen from the PR and hope to get more
> > feedback.
> >
> >
> >
> > Thanks,
> >
> > Ji Liu
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-7539
> >
> > [2] https://github.com/apache/arrow/pull/6156
> >
>

Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers

Posted by Micah Kornfield <em...@gmail.com>.
FWIW, I lack historical context on how these methods evolved, so I'd
appreciate insight from anyone who has worked on the java codebase for a
longer period of time.  The current situation seems less then ideal.

On Tue, Aug 4, 2020 at 12:55 AM Ji Liu <ti...@apache.org> wrote:

> Hi all,
>
>
> When I worked on ARROW-7539[1], I met some problems and not sure what's the
> proper way to solve it.
>
>
> This issue was about to avoid set reader/writer indices in
> FieldVector#getFieldBuffers according to the following reasons:
>
> i. getBuffers set reader/writer indices and it's right for the purpose of
> sending the data over the wire
>
> ii. getFieldBuffers is distinct from getBuffers, it should be for getting
> access to underlying data for higher-performance algorithms
>
>
> Currently in VectorUnloader, we used getFieldBuffers to create
> ArrowRecordBatch that's why we keep writer/reader indices in
> getFieldBuffers
> , we should use getBuffers instead. But during the change, we found another
> problem:
>
> The order of validity and offset buffers are not in the same order in
> ListVector(getBuffers's order is right), changing the API in VectorUnloader
> creates problems with serialization/deserialization resulting in test
> failures in Dremio which would break backward compatibility with existing
> serialised files.
>
>
> Micah gives a solution but seems doesn't reach consistent in the PR
> thread[2
> ]:
>
>    1. Remove setReaderWriterIndeces in getFieldBuffers
>    2. Deprecate getBuffers
>    3. Introduce a new getIpcBuffers which is unambiguously used for writing
>    record batches (i.e. in VectorUnloader).
>    4. Update documentation where it makes sense based on all this
>    conversation.
>
>
> More details and discussions can be seen from the PR and hope to get more
> feedback.
>
>
>
> Thanks,
>
> Ji Liu
>
>
>
> [1] https://issues.apache.org/jira/browse/ARROW-7539
>
> [2] https://github.com/apache/arrow/pull/6156
>