You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "quentin lhoest (Jira)" <ji...@apache.org> on 2021/03/16 18:28:00 UTC

[jira] [Created] (ARROW-11989) [Python] Improve ChunkedArray's complexity for the access of elements

quentin lhoest created ARROW-11989:
--------------------------------------

             Summary: [Python] Improve ChunkedArray's complexity for the access of elements
                 Key: ARROW-11989
                 URL: https://issues.apache.org/jira/browse/ARROW-11989
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 3.0.0
            Reporter: quentin lhoest


Chunked arrays are stored as a C++ vector of Arrays.

There is currently no indexing structure on top of the vector to allow for anything better than O(chunk) to access an arbitrary element.

For example, with a Table consisting of 1 column “text” defined by:
- 1024 chunks
- each chunk is 1024 rows
- each row is a text of 1024 characters

Then the time it takes to access one example are:

{code:java}
Time to access example at i=0%	: 6.7μs
Time to access example at i=10%	: 7.2μs
Time to access example at i=20%	: 9.1μs
Time to access example at i=30%	: 11.4μs
Time to access example at i=40%	: 13.8μs
Time to access example at i=50%	: 16.2μs
Time to access example at i=60%	: 18.7μs
Time to access example at i=70%	: 21.1μs
Time to access example at i=80%	: 26.8μs
Time to access example at i=90%	: 25.2μs
{code}


The time measured are the average times to do `table[“text”][j]` depending on the index we want to fetch (from the first example at 0% to the example at 90% of the length of the table).

You can take a look at the code that produces this benchmark [here|https://pastebin.com/pSkYHQn9].

Some discussions in [this thread on the mailing list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E] suggested different approaches to improve the complexity:
- use a contiguous array of chunk lengths, since having a contiguous array of lengths makes the iteration over the chunks lengths faster;
- use a binary search, as in the Julia implementation [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
- use interpolation search.

Apparently there is also a lookup structure in the compute layer [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].

cc [~emkornfield], [~wesm]

Thanks again for the amazing work !





--
This message was sent by Atlassian Jira
(v8.3.4#803005)