You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2020/10/19 21:24:51 UTC

[kudu-CR] wip python: support for passing arrow data to python

Andrew Wong has uploaded a new patch set (#5) to the change originally created by Todd Lipcon. ( http://gerrit.cloudera.org:8080/15661 )

Change subject: wip python: support for passing arrow data to python
......................................................................

wip python: support for passing arrow data to python

wip: some TODOs that might be worth addressing:
- adding Arrow tests that don't use Pandas
- understand nogil better
- some ugly code around empty scan batches -- could probably be cleaned
  up, but also wouldn't be against merging as is

This adds some support to pass arrow-formatted data to pyarrow. This
takes the form of Arrow's C-structs, as well as helper classes to manage
their memory. These structs are exposed in the Cython wrapper files.

This also adds Python APIs that match the existing Python APIs. Namely,
batch-level APIs that return one scan batch at a time, as well as a
to_pandas() method that materializes the entire scan result as a
DataFrame.

As a sanity check for performance, I wrote a simple script (posted as a
gist[1]) to write a bunch of rows and then serialize them to Pandas
DataFrames:

Materialized tuple DataFrame with 100000 rows in 0.7903690928 secs, averaged across 5 runs
Materialized arrow DataFrame with 100000 rows in 0.05492771920000004 secs, averaged across 5 runs

[1] https://gist.github.com/andrwng/bae4a4696eabb501b3b69f8db263a745

Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
---
M python/kudu/__init__.py
M python/kudu/client.pyx
M python/kudu/errors.pxd
M python/kudu/errors.pyx
M python/kudu/libkudu_client.pxd
M python/kudu/schema.pyx
M python/kudu/tests/test_scanner.py
M python/kudu/tests/util.py
M src/kudu/client/CMakeLists.txt
A src/kudu/client/arrow-internal.h
A src/kudu/client/arrow.h
M src/kudu/client/client-test.cc
M src/kudu/client/client.cc
M src/kudu/client/client.h
M src/kudu/client/columnar_scan_batch.cc
M src/kudu/client/columnar_scan_batch.h
M src/kudu/client/scanner-internal.cc
M src/kudu/client/scanner-internal.h
M src/kudu/client/schema.cc
M src/kudu/client/schema.h
M src/kudu/util/memory/arena.h
21 files changed, 594 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/61/15661/5
-- 
To view, visit http://gerrit.cloudera.org:8080/15661
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5df4714bf5ac339d675f2a9169f8aeff06d30eed
Gerrit-Change-Number: 15661
Gerrit-PatchSet: 5
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Wes McKinney <we...@apache.org>