You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Paul Rogers <pa...@yahoo.com.INVALID> on 2018/09/03 23:06:41 UTC

Re: Contribute "RowSet" mechanism from Apache Drill?

Filed a JIRA ticket: ARROW-3164.

The original e-mail linked to a wiki that explains the Row Set abstraction in the Drill context. The ticket points to a new GitHub wiki that discusses the abstraction in the Arrow context, including examples. The Wiki also explains the motivation: the challenges Drill faced when reading row-oriented data into vectors and reading that data out, and how those may apply to the Arrow context.

Looks like recent Arrow work has greatly improved the interoperability features of the project. Still, at some point, code must write data into vectors and read data out. Often the interface is row-oriented. If that code is in Java, the Row Set abstractions can help.

A new top-level Java module is a great idea. Looks like there might be some dependency issues to resolve to leverage material in the "vector" module which we'll resolve as we hit them.

Here is a quick example on the read side. Jacques recently posted a code example to retrieve data from an vector of VARCHAR columns:

int recordIndexToRead = ...
ListVector lv = ...
ArrowBuf offsetVector = lv.getOffsetBuffer();
VarCharVector vc = lv.getDataVector();
int listStart = offsetVector.getInt((recordIndexToRead ) * 4) ;
int listEnd = offsetVector.getInt((recordIndexToRead + 1) * 4);
NullableVarCharHolder nvh = new NullableVarCharHolder();
for(int i = listStart; i < listEnd; i++){
  vc.get(i, nvh);
  // do something with data.
}

Here is how to iterate over a record batch, accessing a single VARCHAR column, using the Row Set abstractions. The e-mail mentioned a byte array, so let's use that here:

RowSet rowSet = // create row set from record batch
RowSetReader reader = rowSet.reader();
ScalarReader vcReader = reader.scalar("colName"); // Get your VARCHAR column
while (reader.next()) {
  byte data[] = vcReader.getBytes();
  // Do something with the data
}

Data can also be retrieved as a Java String, if that is more convenient in this use case:

  String data = vcReader.getString();

In either case, if the value is a SQL NULL, the above methods (because they return Java objects) will return a Java null. (For primitive types, you can call the ScalarReader.isNull() method.)

Thanks,
- Paul

 

    On Thursday, August 30, 2018, 7:44:51 PM PDT, Jacques Nadeau <ja...@apache.org> wrote:  
 
 New Jira sounds good.

Many times algorithms interact directly with vectors but there also many
times this is not the case. Would be great to see more detail about an
example use. Maybe propose as a new module so people can use if they want
but don't have to consume unless they need to?

On Mon, Aug 27, 2018 at 6:28 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Jacques,
>
> Thanks much for the note. I wonder, when reading data into, or out of,
> Arrow, are not the interfaces often row-wise? For example, it is somewhat
> difficult to read a CSV file column-wise. Similarly, when serving a BI tool
> (for tables or charts), data must be presented row-wise. (JDBC, for
> example, is a row-wise interface.) The abstractions help with these cases.
>
> Perhaps much of the emphasis in Arrow is in cross-tool compatibility in
> which data is passed column-wise as a set of vectors? The abstractions
> wouldn't be needed in this data transfer case.
>
> The batch size component is an essential part of row-wise loading. When
> reading data into vectors, even from Parquet, we found it necessary to 1)
> control the overall amount of memory used by the batch, and 2) read the
> same number of rows for every column. The RowSet abstractions encapsulate
> this coordinated cross-column work.
>
> The memory limits in the "RowSet" abstraction are not estimates. (There
> was a separate Drill project for that, which is why it might be confusing.)
> Instead, the memory limits are based on knowing the current write offset
> into each vector.  In Drill, when a vector becomes full, we automatically
> resize the vector by doubling the memory for that vector. The RowSet
> abstraction tracks when doubling the vector would exceed the "budget" set
> for that vector or batch. When the limit occurs, the abstraction marks the
> batch complete. (The "overflow" row is saved for later to avoid exceeding
> the limit, and to keep the details of overflow hidden from the client.) The
> same logic can be applied, I would assume, to whatever memory allocation
> technique is used in Arrow, if Arrow has evolved beyond Drill's technique.
>
> A size estimate (when available) helps by allowing the client code to
> pre-allocate vectors to their final size. Doing so avoids growing vectors
> during data loads. In this case, the abstractions simply pack data into
> those pre-allocated vectors until one of them becomes full.
>
> The idea of separating memory from reading/writing is sound. In fact,
> that's how the code is structured. The memory-unaware version is heavily
> used in unit tests where we know how much memory is used. The memory-aware
> version is used in production to handle whatever strange data sets present
> themselves.
>
> Of course, none of this was clear from my terse description. I'll go ahead
> and create a JIRA ticket to provide additional context and to gather
> detailed comments so we can figure out the best way to proceed.
>
> Thanks,
>
> - Paul
>
>
>
>    On Monday, August 27, 2018, 5:52:19 PM PDT, Jacques Nadeau <
> jacques@apache.org> wrote:
>
>  This seems like it could be a useful addition. In general, our experience
> with writing Arrow structures is that the most optimal path is using
> columnar interaction rather than rowwise. That being said, most people
> start out by interacting with Arrow rowwise first and having an interface
> like this could be helpful in allowing people to start writing Arrow
> datasets with less effort and mistakes.
>
> In terms of record batch sizing/estimations, I think that should probably
> be uncoupled from writing/reading vectors.
>
>
>
> On Mon, Aug 27, 2018 at 7:00 AM Li Jin <ic...@gmail.com> wrote:
>
> > Hi Paul,
> >
> > Thank you for the email. I think this is interesting.
> >
> > Arrow (Java API) currently doesn't have the capability of automatically
> > limiting the memory size of record batches. In Spark we have similar
> needs
> > to limit the size of record batches and have talked about implementing
> some
> > kind of size estimator for record batches but haven't started to work on
> > it.
> >
> > I personally think it makes sense for Arrow to incorporate such
> > capabilities.
> >
> >
> >
> > On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> > > Hi All,
> > >
> > > Over in the Apache Drill project, we developed some handy vector
> > > reader/writer abstractions. I wonder if they might be of interest to
> > Apache
> > > Arrow. Key contributions of the "RowSet" abstractions:
> > >
> > > * Control row batch size: the aggregate memory taken by a set of
> vectors
> > > (and all their sub-vectors for structured types.)
> > > * Control the maximum per-vector size.
> > > * Simple, highly optimized read/write interface that handles vector
> > offset
> > > accounting, even for deeply nested types.
> > > * Minimize vector internal fragmentation (wasted space.)
> > >
> > > More information is available in [1]. Arrow improved and simplified
> > > Drill's original vector and metadata abstractions. As a result, work
> > would
> > > be required to port the RowSet code from Drill's version of these
> classes
> > > to the Arrow versions.
> > >
> > > Does Arrow already have a similar solution? If not, would the above be
> > > useful for Arrow?
> > >
> > > Thanks,
> > > - Paul
> > >
> > >
> > > Apache Drill PMC member
> > > Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> > > [1]
> > >
> https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
> > >
> > >
> > >
> >
>

Re: Contribute "RowSet" mechanism from Apache Drill?

Posted by Jacques Nadeau <ja...@apache.org>.

Hey Paul, it looks like ScalarReader is simply a renamed version of
FieldReader, which the Arrow Vector module already contains.

On Mon, Sep 3, 2018 at 4:07 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Filed a JIRA ticket: ARROW-3164.
>
> The original e-mail linked to a wiki that explains the Row Set abstraction
> in the Drill context. The ticket points to a new GitHub wiki that discusses
> the abstraction in the Arrow context, including examples. The Wiki also
> explains the motivation: the challenges Drill faced when reading
> row-oriented data into vectors and reading that data out, and how those may
> apply to the Arrow context.
>
> Looks like recent Arrow work has greatly improved the interoperability
> features of the project. Still, at some point, code must write data into
> vectors and read data out. Often the interface is row-oriented. If that
> code is in Java, the Row Set abstractions can help.
>
> A new top-level Java module is a great idea. Looks like there might be
> some dependency issues to resolve to leverage material in the "vector"
> module which we'll resolve as we hit them.
>
> Here is a quick example on the read side. Jacques recently posted a code
> example to retrieve data from an vector of VARCHAR columns:
>
> int recordIndexToRead = ...
> ListVector lv = ...
> ArrowBuf offsetVector = lv.getOffsetBuffer();
> VarCharVector vc = lv.getDataVector();
> int listStart = offsetVector.getInt((recordIndexToRead ) * 4) ;
> int listEnd = offsetVector.getInt((recordIndexToRead + 1) * 4);
> NullableVarCharHolder nvh = new NullableVarCharHolder();
> for(int i = listStart; i < listEnd; i++){
>   vc.get(i, nvh);
>   // do something with data.
> }
>
> Here is how to iterate over a record batch, accessing a single VARCHAR
> column, using the Row Set abstractions. The e-mail mentioned a byte array,
> so let's use that here:
>
> RowSet rowSet = // create row set from record batch
> RowSetReader reader = rowSet.reader();
> ScalarReader vcReader = reader.scalar("colName"); // Get your VARCHAR
> column
> while (reader.next()) {
>   byte data[] = vcReader.getBytes();
>   // Do something with the data
> }
>
> Data can also be retrieved as a Java String, if that is more convenient in
> this use case:
>
>   String data = vcReader.getString();
>
> In either case, if the value is a SQL NULL, the above methods (because
> they return Java objects) will return a Java null. (For primitive types,
> you can call the ScalarReader.isNull() method.)
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, August 30, 2018, 7:44:51 PM PDT, Jacques Nadeau <
> jacques@apache.org> wrote:
>
>  New Jira sounds good.
>
> Many times algorithms interact directly with vectors but there also many
> times this is not the case. Would be great to see more detail about an
> example use. Maybe propose as a new module so people can use if they want
> but don't have to consume unless they need to?
>
> On Mon, Aug 27, 2018 at 6:28 PM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Hi Jacques,
> >
> > Thanks much for the note. I wonder, when reading data into, or out of,
> > Arrow, are not the interfaces often row-wise? For example, it is somewhat
> > difficult to read a CSV file column-wise. Similarly, when serving a BI
> tool
> > (for tables or charts), data must be presented row-wise. (JDBC, for
> > example, is a row-wise interface.) The abstractions help with these
> cases.
> >
> > Perhaps much of the emphasis in Arrow is in cross-tool compatibility in
> > which data is passed column-wise as a set of vectors? The abstractions
> > wouldn't be needed in this data transfer case.
> >
> > The batch size component is an essential part of row-wise loading. When
> > reading data into vectors, even from Parquet, we found it necessary to 1)
> > control the overall amount of memory used by the batch, and 2) read the
> > same number of rows for every column. The RowSet abstractions encapsulate
> > this coordinated cross-column work.
> >
> > The memory limits in the "RowSet" abstraction are not estimates. (There
> > was a separate Drill project for that, which is why it might be
> confusing.)
> > Instead, the memory limits are based on knowing the current write offset
> > into each vector.  In Drill, when a vector becomes full, we automatically
> > resize the vector by doubling the memory for that vector. The RowSet
> > abstraction tracks when doubling the vector would exceed the "budget" set
> > for that vector or batch. When the limit occurs, the abstraction marks
> the
> > batch complete. (The "overflow" row is saved for later to avoid exceeding
> > the limit, and to keep the details of overflow hidden from the client.)
> The
> > same logic can be applied, I would assume, to whatever memory allocation
> > technique is used in Arrow, if Arrow has evolved beyond Drill's
> technique.
> >
> > A size estimate (when available) helps by allowing the client code to
> > pre-allocate vectors to their final size. Doing so avoids growing vectors
> > during data loads. In this case, the abstractions simply pack data into
> > those pre-allocated vectors until one of them becomes full.
> >
> > The idea of separating memory from reading/writing is sound. In fact,
> > that's how the code is structured. The memory-unaware version is heavily
> > used in unit tests where we know how much memory is used. The
> memory-aware
> > version is used in production to handle whatever strange data sets
> present
> > themselves.
> >
> > Of course, none of this was clear from my terse description. I'll go
> ahead
> > and create a JIRA ticket to provide additional context and to gather
> > detailed comments so we can figure out the best way to proceed.
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Monday, August 27, 2018, 5:52:19 PM PDT, Jacques Nadeau <
> > jacques@apache.org> wrote:
> >
> >  This seems like it could be a useful addition. In general, our
> experience
> > with writing Arrow structures is that the most optimal path is using
> > columnar interaction rather than rowwise. That being said, most people
> > start out by interacting with Arrow rowwise first and having an interface
> > like this could be helpful in allowing people to start writing Arrow
> > datasets with less effort and mistakes.
> >
> > In terms of record batch sizing/estimations, I think that should probably
> > be uncoupled from writing/reading vectors.
> >
> >
> >
> > On Mon, Aug 27, 2018 at 7:00 AM Li Jin <ic...@gmail.com> wrote:
> >
> > > Hi Paul,
> > >
> > > Thank you for the email. I think this is interesting.
> > >
> > > Arrow (Java API) currently doesn't have the capability of automatically
> > > limiting the memory size of record batches. In Spark we have similar
> > needs
> > > to limit the size of record batches and have talked about implementing
> > some
> > > kind of size estimator for record batches but haven't started to work
> on
> > > it.
> > >
> > > I personally think it makes sense for Arrow to incorporate such
> > > capabilities.
> > >
> > >
> > >
> > > On Mon, Aug 27, 2018 at 1:33 AM Paul Rogers <par0328@yahoo.com.invalid
> >
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > Over in the Apache Drill project, we developed some handy vector
> > > > reader/writer abstractions. I wonder if they might be of interest to
> > > Apache
> > > > Arrow. Key contributions of the "RowSet" abstractions:
> > > >
> > > > * Control row batch size: the aggregate memory taken by a set of
> > vectors
> > > > (and all their sub-vectors for structured types.)
> > > > * Control the maximum per-vector size.
> > > > * Simple, highly optimized read/write interface that handles vector
> > > offset
> > > > accounting, even for deeply nested types.
> > > > * Minimize vector internal fragmentation (wasted space.)
> > > >
> > > > More information is available in [1]. Arrow improved and simplified
> > > > Drill's original vector and metadata abstractions. As a result, work
> > > would
> > > > be required to port the RowSet code from Drill's version of these
> > classes
> > > > to the Arrow versions.
> > > >
> > > > Does Arrow already have a similar solution? If not, would the above
> be
> > > > useful for Arrow?
> > > >
> > > > Thanks,
> > > > - Paul
> > > >
> > > >
> > > > Apache Drill PMC member
> > > > Co-author of the upcoming O'Reilly book "Learning Apache Drill"
> > > > [1]
> > > >
> > https://github.com/paul-rogers/drill/wiki/RowSet-Abstractions-for-Arrow
> > > >
> > > >
> > > >
> > >
> >
>