You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Chang She <ch...@eto.ai> on 2022/09/20 17:12:43 UTC

[compute] limit push-down and nested types

Hi there,

We're creating a new columnar data format for computer vision with Arrow
integration as a first class citizen (github.com/eto-ai/lance). It
significantly outperforms parquet in a variety of computer vision workloads.

*Question 1:*

Because vision data tends to be large-ish blobs, we want to be very careful
about how much data is being retrieved. So we want to be able to push-down
limit/offset when it's appropriate to support data exploration queries
(e.g., "show me page 2 of N images that meet *these* filtering criteria").
For now we've implemented our own Dataset/Scanner subclass to support these
extra options.

Example:

```python
lance.dataset(uri).scanner(limit=10, offset=20)
```

And the idea is that it would only retrieve those rows from disk.

However, I'm wondering if there's a better path to integrating that more
"natively" into Arrow. Happy to make contributions if that's an option.


*Question 2:*

In computer vision we're often dealing with deeply nested data types (e.g.,
for object detection annotations that has a list of labels, bounding boxes,
polygons, etc for each image). Lance supports efficient filtering scans on
these list-of-struct columns, but a) it's hard to express nested field
references (i have to use a non-public pc.Expression._nested_field and
convert list-of-struct to struct-of-list), and b) the compute operations
are not implemented for List arrays.

Any guidance/thoughts on how y'all are thinking about efficient compute on
nested data?

Thanks!

Chang

Re: [compute] limit push-down and nested types

Posted by Lei Xu <le...@apache.org>.
Happy to provide some implementation details.

On 2022/09/20 18:58:18 Weston Pace wrote:
> > However, I'm wondering if there's a better path to integrating that more "natively" into Arrow. Happy to make contributions if that's an option.
> 
> I'm not quite sure I understand where Arrow integration comes into
> play here.  Would that scanner use Arrow internally?  Or would you
> only convert the output to Arrow?

We integrate this lance format with Arrow via Arrow's C++ Scanner / Dataset APIs. 
https://arrow.apache.org/docs/cpp/api/dataset.html#dataset

We implemented a LanceFileFormat / LanceFileFragment, inherenting from the counterparts from "arrow::dataset::{FileFormat, FileFragment}", making the format works with "arrow::dataset::Dataset" in C++, and so with pyarrow's Dataset API (https://arrow.apache.org/docs/python/api/dataset.html) via cython.

Underneath, lance relies on "arrow::dataset::ScanOptions" to pass projections and predictions. If we can push Limit / Offset via ScanOptions, it allows us to skip reading heavy columns in vision data. 

> If you're planning on using Arrow's scanner internally then I think
> there are improvements that could be made to add support for partially
> reading a dataset.  There are a few tentative explorations here (e.g.
> I think marsupialtail has been looking at slicing fragments and there
> is some existing Spark work with slicing fragments).
> 
> If you're only converting the output to Arrow then I don't think there
> is any special interoperability concerns.
> 
> What does `scanner` return?  Does it return a table?  Or an iterator?

We implemented FileFormat::ScanBatchAsyncs (C++) interface (https://github.com/apache/arrow/blob/40ec95646962cccdcd62032c80e8506d4c275bc6/cpp/src/arrow/dataset/file_base.h#L153-L155)


> > it's hard to express nested field references (i have to use a non-public pc.Expression._nested_field and convert list-of-struct to struct-of-list)
> 
> There was some work recently to add better support for nested field references:
> 
> ```
> >>> import pyarrow as pa
> >>> import pyarrow.dataset as ds
> 
> >>> points = [{'x': 7, 'y': 10}, {'x': 12, 'y': 44}]
> >>> table = pa.Table.from_pydict('points': points)
> >>> ds.write_dataset(table, '/tmp/my_dataset', format='parquet')
> 
> >>> ds.dataset('/tmp/my_dataset').to_table(columns={'x': ds.field('points', 'x')}, filter=(ds.field('points', 'y') > 20))
> pyarrow.Table
> x: int64
> ----
> x: [[12]]
> ```
> 
> > the compute operations are not implemented for List arrays.
> 
> What sorts of operations would you like to see on list arrays?  Can
> you give an example?

It would be nice if we could have expressions like `list_contains` or `list_size/list_length` to push alone with "ScanOptions::filters", which offers the opportunity to save some I/O to read vision data. 

> On Tue, Sep 20, 2022 at 10:13 AM Chang She <ch...@eto.ai> wrote:
> >
> > Hi there,
> >
> > We're creating a new columnar data format for computer vision with Arrow integration as a first class citizen (github.com/eto-ai/lance). It significantly outperforms parquet in a variety of computer vision workloads.
> >
> > Question 1:
> >
> > Because vision data tends to be large-ish blobs, we want to be very careful about how much data is being retrieved. So we want to be able to push-down limit/offset when it's appropriate to support data exploration queries (e.g., "show me page 2 of N images that meet these filtering criteria"). For now we've implemented our own Dataset/Scanner subclass to support these extra options.
> >
> > Example:
> >
> > ```python
> > lance.dataset(uri).scanner(limit=10, offset=20)
> > ```
> >
> > And the idea is that it would only retrieve those rows from disk.
> >
> > However, I'm wondering if there's a better path to integrating that more "natively" into Arrow. Happy to make contributions if that's an option.
> >
> >
> > Question 2:
> >
> > In computer vision we're often dealing with deeply nested data types (e.g., for object detection annotations that has a list of labels, bounding boxes, polygons, etc for each image). Lance supports efficient filtering scans on these list-of-struct columns, but a) it's hard to express nested field references (i have to use a non-public pc.Expression._nested_field and convert list-of-struct to struct-of-list), and b) the compute operations are not implemented for List arrays.
> >
> > Any guidance/thoughts on how y'all are thinking about efficient compute on nested data?
> >
> > Thanks!
> >
> > Chang
> >

Best, 

Lei

Re: [compute] limit push-down and nested types

Posted by Weston Pace <we...@gmail.com>.
> However, I'm wondering if there's a better path to integrating that more "natively" into Arrow. Happy to make contributions if that's an option.

I'm not quite sure I understand where Arrow integration comes into
play here.  Would that scanner use Arrow internally?  Or would you
only convert the output to Arrow?

If you're planning on using Arrow's scanner internally then I think
there are improvements that could be made to add support for partially
reading a dataset.  There are a few tentative explorations here (e.g.
I think marsupialtail has been looking at slicing fragments and there
is some existing Spark work with slicing fragments).

If you're only converting the output to Arrow then I don't think there
is any special interoperability concerns.

What does `scanner` return?  Does it return a table?  Or an iterator?

> it's hard to express nested field references (i have to use a non-public pc.Expression._nested_field and convert list-of-struct to struct-of-list)

There was some work recently to add better support for nested field references:

```
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds

>>> points = [{'x': 7, 'y': 10}, {'x': 12, 'y': 44}]
>>> table = pa.Table.from_pydict('points': points)
>>> ds.write_dataset(table, '/tmp/my_dataset', format='parquet')

>>> ds.dataset('/tmp/my_dataset').to_table(columns={'x': ds.field('points', 'x')}, filter=(ds.field('points', 'y') > 20))
pyarrow.Table
x: int64
----
x: [[12]]
```

> the compute operations are not implemented for List arrays.

What sorts of operations would you like to see on list arrays?  Can
you give an example?

On Tue, Sep 20, 2022 at 10:13 AM Chang She <ch...@eto.ai> wrote:
>
> Hi there,
>
> We're creating a new columnar data format for computer vision with Arrow integration as a first class citizen (github.com/eto-ai/lance). It significantly outperforms parquet in a variety of computer vision workloads.
>
> Question 1:
>
> Because vision data tends to be large-ish blobs, we want to be very careful about how much data is being retrieved. So we want to be able to push-down limit/offset when it's appropriate to support data exploration queries (e.g., "show me page 2 of N images that meet these filtering criteria"). For now we've implemented our own Dataset/Scanner subclass to support these extra options.
>
> Example:
>
> ```python
> lance.dataset(uri).scanner(limit=10, offset=20)
> ```
>
> And the idea is that it would only retrieve those rows from disk.
>
> However, I'm wondering if there's a better path to integrating that more "natively" into Arrow. Happy to make contributions if that's an option.
>
>
> Question 2:
>
> In computer vision we're often dealing with deeply nested data types (e.g., for object detection annotations that has a list of labels, bounding boxes, polygons, etc for each image). Lance supports efficient filtering scans on these list-of-struct columns, but a) it's hard to express nested field references (i have to use a non-public pc.Expression._nested_field and convert list-of-struct to struct-of-list), and b) the compute operations are not implemented for List arrays.
>
> Any guidance/thoughts on how y'all are thinking about efficient compute on nested data?
>
> Thanks!
>
> Chang
>