You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2020/04/06 19:45:38 UTC

[kudu-CR] WIP: support for passing arrow data to python

Hello Grant Henke,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/15661

to review the following change.


Change subject: WIP: support for passing arrow data to python
......................................................................

WIP: support for passing arrow data to python

This adds some support to pass arrow-formatted data to pyarrow.

WIP because it needs some cleanup, and also need to figure out how to
test this, given it depends on pyarrow nightly builds. The required
feature should be in the next pyarrow release due in a "couple weeks" so
will probably hold off until then.

Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
---
M python/kudu/client.pyx
M python/kudu/libkudu_client.pxd
M python/kudu/schema.pyx
M src/kudu/client/CMakeLists.txt
M src/kudu/client/client-test.cc
M src/kudu/client/client.cc
M src/kudu/client/client.h
M src/kudu/client/columnar_scan_batch.cc
M src/kudu/client/columnar_scan_batch.h
M src/kudu/client/scanner-internal.cc
M src/kudu/client/scanner-internal.h
M src/kudu/client/schema.cc
M src/kudu/client/schema.h
M src/kudu/util/memory/arena.h
14 files changed, 254 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/15661/1
-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>

[kudu-CR] wip python: support for passing arrow data to python

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: wip python: support for passing arrow data to python
......................................................................


Patch Set 5:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx
File python/kudu/client.pyx:

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@60
PS4, Line 60:     if LooseVersion(pyarrow.__version__) < LooseVersion(min_arrow_version):
> A version check might not be a bad idea since the C interface didn't appear
Done


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2089
PS4, Line 2089:                 [pyarrow.array([]) for c in range(len(self._arrow_schema.names))],
> Recommend `with nogil:` block here
Done


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2091
PS4, Line 2091: 
> In principle this 0-row case shouldn't be necessary?
Good catch. Seems the empty scan-batch case isn't handled properly. I'm punting on it for now though since it requires some additional plumbing. I may come around to this before merging if I can find more time to work on this.


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2092
PS4, Line 2092:         if not self.has_more_rows():
> with nogil:
Done


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2224
PS4, Line 2224:         import pyarrow
> Can we have a `to_arrow()` method, too? Then `to_pandas()` is just `self.to
Done


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2230
PS4, Line 2230:         return pyarrow.Table.from_batches(arrow_batches, schema=self._arrow_schema)
> This formatting is a bit odd, generally we might see something like
Done


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/schema.pyx
File python/kudu/schema.pyx:

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/schema.pyx@785
PS4, Line 785:         check_status(self.schema.ToArrowSchema(&arrow_schema))
> with nogil:
Done


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/tests/test_scanner.py
File python/kudu/tests/test_scanner.py:

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/tests/test_scanner.py@407
PS4, Line 407: 
> Should add TODO to improve these tests to test functionality without requir
Done


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/tests/util.py
File python/kudu/tests/util.py:

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/tests/util.py@71
PS4, Line 71:                 # use precision>18 to ensure we use DECIMAL128.
> NB: someone is working on this right now but it might be 3-6 months before 
Thanks for calling that out, I mentioned ARROW-9404 here. Kudu doesn't current support the promotion. Bitwidth is handled automatically in Kudu based on precision and scale, and is hidden from end users. I'm a bit hesitant to start exposing that to users, so I'll punt on it for now (filed KUDU-3203 for it).


http://gerrit.cloudera.org:8080/#/c/15661/4/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

http://gerrit.cloudera.org:8080/#/c/15661/4/src/kudu/client/client-test.cc@1131
PS4, Line 1131:     if (as.release) {
> unchecked status?
Done


http://gerrit.cloudera.org:8080/#/c/15661/4/src/kudu/client/client-test.cc@1145
PS4, Line 1145:         b.release(&b);
> warning: invalid case style for global constant 'read_modes' [readability-i
Ack



-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Mon, 19 Oct 2020 22:12:11 +0000
Gerrit-HasComments: Yes

[kudu-CR] python: support for passing arrow data to python

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has uploaded a new patch set (#4) to the change originally created by Todd Lipcon. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: python: support for passing arrow data to python
......................................................................

python: support for passing arrow data to python

This adds some support to pass arrow-formatted data to pyarrow. This
takes the form of Arrow's C-structs, as well as helper classes to manage
their memory. These structs are exposed in the Cython wrapper files.

This also adds Python APIs that match the existing Python APIs. Namely,
batch-level APIs that return one scan batch at a time, as well as a
to_pandas() method that materializes the entire scan result as a
DataFrame.

As a sanity check for performance, I wrote a simple script (posted as a
gist[1]) to write a bunch of rows and then serialize them to Pandas
DataFrames:

Materialized tuple DataFrame with 100000 rows in 0.7903690928 secs, averaged across 5 runs
Materialized arrow DataFrame with 100000 rows in 0.05492771920000004 secs, averaged across 5 runs

[1] https://gist.github.com/andrwng/bae4a4696eabb501b3b69f8db263a745

Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
---
M python/kudu/__init__.py
M python/kudu/client.pyx
M python/kudu/libkudu_client.pxd
M python/kudu/schema.pyx
M python/kudu/tests/test_scanner.py
M python/kudu/tests/util.py
M src/kudu/client/CMakeLists.txt
A src/kudu/client/arrow-internal.h
A src/kudu/client/arrow.h
M src/kudu/client/client-test.cc
M src/kudu/client/client.cc
M src/kudu/client/client.h
M src/kudu/client/columnar_scan_batch.cc
M src/kudu/client/columnar_scan_batch.h
M src/kudu/client/scanner-internal.cc
M src/kudu/client/scanner-internal.h
M src/kudu/client/schema.cc
M src/kudu/client/schema.h
M src/kudu/util/memory/arena.h
19 files changed, 539 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/15661/4
-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>

[kudu-CR] wip python: support for passing arrow data to python

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: wip python: support for passing arrow data to python
......................................................................


Patch Set 5:

> Patch Set 5:
> 
> Sorry for dropping the ball on moving along this review. Would you like to rebase this and we can try to get it merged?

No worries, I tabled the work a bit myself, but I'd be happy to rebase this later this week.


-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Fri, 25 Jun 2021 00:38:07 +0000
Gerrit-HasComments: No

[kudu-CR] wip python: support for passing arrow data to python

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has uploaded a new patch set (#5) to the change originally created by Todd Lipcon. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: wip python: support for passing arrow data to python
......................................................................

wip python: support for passing arrow data to python

wip: some TODOs that might be worth addressing:
- adding Arrow tests that don't use Pandas
- understand nogil better
- some ugly code around empty scan batches -- could probably be cleaned
  up, but also wouldn't be against merging as is

This adds some support to pass arrow-formatted data to pyarrow. This
takes the form of Arrow's C-structs, as well as helper classes to manage
their memory. These structs are exposed in the Cython wrapper files.

This also adds Python APIs that match the existing Python APIs. Namely,
batch-level APIs that return one scan batch at a time, as well as a
to_pandas() method that materializes the entire scan result as a
DataFrame.

As a sanity check for performance, I wrote a simple script (posted as a
gist[1]) to write a bunch of rows and then serialize them to Pandas
DataFrames:

Materialized tuple DataFrame with 100000 rows in 0.7903690928 secs, averaged across 5 runs
Materialized arrow DataFrame with 100000 rows in 0.05492771920000004 secs, averaged across 5 runs

[1] https://gist.github.com/andrwng/bae4a4696eabb501b3b69f8db263a745

Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
---
M python/kudu/__init__.py
M python/kudu/client.pyx
M python/kudu/errors.pxd
M python/kudu/errors.pyx
M python/kudu/libkudu_client.pxd
M python/kudu/schema.pyx
M python/kudu/tests/test_scanner.py
M python/kudu/tests/util.py
M src/kudu/client/CMakeLists.txt
A src/kudu/client/arrow-internal.h
A src/kudu/client/arrow.h
M src/kudu/client/client-test.cc
M src/kudu/client/client.cc
M src/kudu/client/client.h
M src/kudu/client/columnar_scan_batch.cc
M src/kudu/client/columnar_scan_batch.h
M src/kudu/client/scanner-internal.cc
M src/kudu/client/scanner-internal.h
M src/kudu/client/schema.cc
M src/kudu/client/schema.h
M src/kudu/util/memory/arena.h
21 files changed, 594 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/15661/5
-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>

[kudu-CR] WIP: support for passing arrow data to python

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Grant Henke, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15661

to look at the new patch set (#2).

Change subject: WIP: support for passing arrow data to python
......................................................................

WIP: support for passing arrow data to python

This adds some support to pass arrow-formatted data to pyarrow.

WIP because it needs some cleanup, and also need to figure out how to
test this, given it depends on pyarrow nightly builds. The required
feature should be in the next pyarrow release due in a "couple weeks" so
will probably hold off until then.

Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
---
M python/kudu/client.pyx
M python/kudu/libkudu_client.pxd
M python/kudu/schema.pyx
M src/kudu/client/CMakeLists.txt
A src/kudu/client/arrow-internal.h
A src/kudu/client/arrow.h
M src/kudu/client/client-test.cc
M src/kudu/client/client.cc
M src/kudu/client/client.h
M src/kudu/client/columnar_scan_batch.cc
M src/kudu/client/columnar_scan_batch.h
M src/kudu/client/scanner-internal.cc
M src/kudu/client/scanner-internal.h
M src/kudu/client/schema.cc
M src/kudu/client/schema.h
M src/kudu/util/memory/arena.h
16 files changed, 399 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/15661/2
-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] WIP: support for passing arrow data to python

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: WIP: support for passing arrow data to python
......................................................................


Patch Set 2:

Thanks Wes! Feel free to reach out on Slack or the mailing list if we can help.


-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Sun, 17 May 2020 15:55:49 +0000
Gerrit-HasComments: No

[kudu-CR] wip python: support for passing arrow data to python

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: wip python: support for passing arrow data to python
......................................................................


Patch Set 5:

> Patch Set 5:
> 
> > Patch Set 5:
> > 
> > Sorry for dropping the ball on moving along this review. Would you like to rebase this and we can try to get it merged?
> 
> No worries, I tabled the work a bit myself, but I'd be happy to rebase this later this week.

I'm taking the next few days off so I probably won't get to this until next week.

That said, I started looking into handling empty columnar scan batches and I don't see anything specific to empty ArrowArrays in the C data interface spec. Just wanted to check to make sure I'm not missing it: do we need to set pointers for the children of an ArrowArray even if the length is 0?


-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Sun, 27 Jun 2021 06:33:39 +0000
Gerrit-HasComments: No

[kudu-CR] WIP: support for passing arrow data to python

Posted by "Wes McKinney (Code Review)" <ge...@cloudera.org>.
Wes McKinney has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: WIP: support for passing arrow data to python
......................................................................


Patch Set 2:

I'd be interested in helping get this across the finish line. I haven't built Kudu in a while so once I set up the toolchain to build I can take a look at this


-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Sun, 17 May 2020 01:45:55 +0000
Gerrit-HasComments: No

[kudu-CR] wip python: support for passing arrow data to python

Posted by "Wes McKinney (Code Review)" <ge...@cloudera.org>.
Wes McKinney has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: wip python: support for passing arrow data to python
......................................................................


Patch Set 5:

> That said, I started looking into handling empty columnar scan batches and I don't see anything specific to empty ArrowArrays in the C data interface spec. Just wanted to check to make sure I'm not missing it: do we need to set pointers for the children of an ArrowArray even if the length is 0?

If there would be children when the array has non-zero length, yes, I am pretty sure they need to be set, too. This might be something that could be raised in Arrow as a possible relaxation / simplification


-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Tue, 29 Jun 2021 09:35:24 +0000
Gerrit-HasComments: No

[kudu-CR] WIP: support for passing arrow data to python

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: WIP: support for passing arrow data to python
......................................................................


Patch Set 2:

(2 comments)

Just scanned this given it's a WIP

http://gerrit.cloudera.org:8080/#/c/15661/2/src/kudu/client/schema.cc
File src/kudu/client/schema.cc:

http://gerrit.cloudera.org:8080/#/c/15661/2/src/kudu/client/schema.cc@928
PS2, Line 928: const char* ArrowFormatForType(DataType type) {
It could be worth linking to the documentation on this:
https://github.com/apache/arrow/blob/master/docs/source/format/CDataInterface.rst


http://gerrit.cloudera.org:8080/#/c/15661/2/src/kudu/client/schema.cc@941
PS2, Line 941:       // TODO(todd) support missing types
I think we can support DATE with `tdD `:
https://github.com/apache/arrow/blob/96a252290714000c3c9136955c850e2c287e4a51/docs/source/format/CDataInterface.rst#data-type-description----format-strings

I think our decimal could work too. The format is d:<precision>,<scale>. Example: d:19,10



-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Tue, 07 Apr 2020 13:55:15 +0000
Gerrit-HasComments: Yes

[kudu-CR] python: support for passing arrow data to python

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has uploaded a new patch set (#3) to the change originally created by Todd Lipcon. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: python: support for passing arrow data to python
......................................................................

python: support for passing arrow data to python

This adds some support to pass arrow-formatted data to pyarrow. This
takes the form of Arrow's C-structs, as well as helper classes to manage
their memory. These structs are exposed in the Cython wrapper files.

This also adds Python APIs that match the existing Python APIs. Namely,
batch-level APIs that return one scan batch at a time, as well as a
to_pandas() method that materializes the entire scan result as a
DataFrame.

Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
---
M python/kudu/__init__.py
M python/kudu/client.pyx
M python/kudu/libkudu_client.pxd
M python/kudu/schema.pyx
M python/kudu/tests/test_scanner.py
M python/kudu/tests/util.py
M src/kudu/client/CMakeLists.txt
A src/kudu/client/arrow-internal.h
A src/kudu/client/arrow.h
M src/kudu/client/client-test.cc
M src/kudu/client/client.cc
M src/kudu/client/client.h
M src/kudu/client/columnar_scan_batch.cc
M src/kudu/client/columnar_scan_batch.h
M src/kudu/client/scanner-internal.cc
M src/kudu/client/scanner-internal.h
M src/kudu/client/schema.cc
M src/kudu/client/schema.h
M src/kudu/util/memory/arena.h
19 files changed, 540 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/15661/3
-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>

[kudu-CR] python: support for passing arrow data to python

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: python: support for passing arrow data to python
......................................................................


Patch Set 3:

(2 comments)

Had some time this weekend so I rebased this and added some tests.

http://gerrit.cloudera.org:8080/#/c/15661/2/src/kudu/client/schema.cc
File src/kudu/client/schema.cc:

http://gerrit.cloudera.org:8080/#/c/15661/2/src/kudu/client/schema.cc@928
PS2, Line 928: // Returns a c-string for the column type, using 'arena' to store
> It could be worth linking to the documentation on this:
Done


http://gerrit.cloudera.org:8080/#/c/15661/2/src/kudu/client/schema.cc@941
PS2, Line 941:     case FLOAT: return "f";
> I think we can support DATE with `tdD `:
Done

Noteworthy that Arrow doesn't support DECIMAL32 and DECIMAL64. Might be worth allowing coercion to more granular decimals (i.e. from 32 to 64, 64 to 128, etc).



-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 3
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Sun, 18 Oct 2020 03:54:21 +0000
Gerrit-HasComments: Yes

[kudu-CR] python: support for passing arrow data to python

Posted by "Wes McKinney (Code Review)" <ge...@cloudera.org>.
Wes McKinney has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: python: support for passing arrow data to python
......................................................................


Patch Set 4:

(10 comments)

Thanks for updating this -- I left a few comments mostly about the pyarrow/Cython portions

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx
File python/kudu/client.pyx:

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@60
PS4, Line 60:     CLIENT_SUPPORTS_ARROW = False
A version check might not be a bad idea since the C interface didn't appear in pyarrow until ~0.17.0


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2089
PS4, Line 2089:         check_status(self.scanner.NextBatch(&batch))
Recommend `with nogil:` block here


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2091
PS4, Line 2091:             return _empty_batch()
In principle this 0-row case shouldn't be necessary?


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2092
PS4, Line 2092:         check_status(batch.ToArrow(&arrow))
with nogil:


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2224
PS4, Line 2224:         if self.columnar_enabled:
Can we have a `to_arrow()` method, too? Then `to_pandas()` is just `self.to_arrow().to_pandas()`


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/client.pyx@2230
PS4, Line 2230:                     if arrow_batch and arrow_batch.num_rows != 0 ]
This formatting is a bit odd, generally we might see something like

[arrow_batch
 for arrow_batch in self.xbatches_arrow()
 if arrow_batch and arrow_batch.num_rows != 0]


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/schema.pyx
File python/kudu/schema.pyx:

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/schema.pyx@785
PS4, Line 785:         check_status(self.schema.ToArrowSchema(&arrow_schema))
with nogil:


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/tests/test_scanner.py
File python/kudu/tests/test_scanner.py:

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/tests/test_scanner.py@407
PS4, Line 407: 
Should add TODO to improve these tests to test functionality without requiring pandas (which may in turn require implementing `to_arrow()` when columnar layout is disabled, so more work required)


http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/tests/util.py
File python/kudu/tests/util.py:

http://gerrit.cloudera.org:8080/#/c/15661/4/python/kudu/tests/util.py@71
PS4, Line 71:                 # use precision>18 to ensure we use DECIMAL128.
NB: someone is working on this right now but it might be 3-6 months before it could get picked up by Kudu. Is there a way to promote decimal32/64 to decimal128 in the meantime?


http://gerrit.cloudera.org:8080/#/c/15661/4/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

http://gerrit.cloudera.org:8080/#/c/15661/4/src/kudu/client/client-test.cc@1131
PS4, Line 1131:   scanner.GetProjectionSchema().ToArrowSchema(&as);
unchecked status?



-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 4
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Sun, 18 Oct 2020 23:04:31 +0000
Gerrit-HasComments: Yes

[kudu-CR] wip python: support for passing arrow data to python

Posted by "Wes McKinney (Code Review)" <ge...@cloudera.org>.
Wes McKinney has posted comments on this change. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: wip python: support for passing arrow data to python
......................................................................


Patch Set 5:

Sorry for dropping the ball on moving along this review. Would you like to rebase this and we can try to get it merged?


-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>
Gerrit-Comment-Date: Wed, 23 Jun 2021 21:45:42 +0000
Gerrit-HasComments: No