You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Scott Wilson (Jira)" <ji...@apache.org> on 2020/09/08 20:42:00 UTC

[jira] [Updated] (ARROW-8199) [C++] Guidance for creating multi-column sort on Table example?

     [ https://issues.apache.org/jira/browse/ARROW-8199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Wilson updated ARROW-8199:
--------------------------------
    Attachment: DataFrame.h

Hey Wes,

I hope you and yours are doing well in this strange time.

I'm just writing to thank you for all the work you did on Arrow and the
various discussions you've posted about the design decisions that drove
this development, post pandas. I've largely completed my C++ DataFrame and
replaced python/pandas code that we use for our ML pipeline. Using the
Arrow framework, I've been able to create a DataFrame object that wraps one
or more arrow tables. The implementation supports no-copy subsets, joins
and concatenations, and stl-like iterators. Also supported are transforms
using in-place lambda functions. The net is that a ~1 TB data processing
step that used to take 13 h now requires 15 m.

The only kluge I put into place has to do with support for null values. I
allow in-place editing of values, but no changes to array sizes or types.
This is possible because the typed arrays offer access to the underlying
raw values. To offer the same for null values I had to create derived
classes for Array and ChunkedArray offer access to the cached null_counts.

I've attached the DataFrame header in case it's of interest.

Thanks again, Scott




-- 
Scott B. Wilson
Chairman and Chief Scientist
Persyst Development Corporation
420 Stevens Avenue, Suite 210
Solana Beach, CA 92075


> [C++] Guidance for creating multi-column sort on Table example?
> ---------------------------------------------------------------
>
>                 Key: ARROW-8199
>                 URL: https://issues.apache.org/jira/browse/ARROW-8199
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: C++
>    Affects Versions: 0.16.0
>            Reporter: Scott Wilson
>            Priority: Minor
>              Labels: c++, newbie
>         Attachments: ArrowCsv.cpp, DataFrame.h
>
>
> I'm just coming up to speed with Arrow and am noticing a dearth of examples ... maybe I can help here.
> I'd like to implement multi-column sorting for Tables and just want to ensure that I'm not duplicating existing work or proposing a bad design.
> My thought was to create a Table-specific version of SortToIndices() where you can specify the columns and sort order.
> Then I'd create Array "views" that use the Indices to remap from the original Array values to the values in sorted order. (Original data is not sorted, but could be as a second step.) I noticed some of the array list variants keep offsets, but didn't see anything that supports remapping per a list of indices, but this may just be my oversight?
> Thanks in advance, Scott



--
This message was sent by Atlassian Jira
(v8.3.4#803005)