You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Eduardo Ponce (Jira)" <ji...@apache.org> on 2021/12/27 07:13:00 UTC

[jira] [Comment Edited] (ARROW-11989) [C++][Python] Improve ChunkedArray's complexity for the access of elements

    [ https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17465561#comment-17465561 ] 

Eduardo Ponce edited comment on ARROW-11989 at 12/27/21, 7:12 AM:
------------------------------------------------------------------

Is the structure of the example ChunkedArray common (ie., having many chunks with the same length)? If that is the case, this type of access pattern would benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks are of the same length and thus chunk access would be O(1). This can be implemented without defining a new class but by simply having a flag the ChunkedArray uses to track if chunks are of the same length.

Now, wrt to using a binary search instead of a linear search for finding the chunk of interest, I expect the binary search to improve access time for high-value indices but worsen access time for low-value indices due to the overhead of performing binary search. The overall access time will depend on the application and access patterns and although a binary search would make the overall chunk finding more consistent it will also have its drawback for certain cases.

_Food for thought:_ An alternative solution is to allow the client code to specify the direction of the linear search. This will help control performance based on the expected access patterns. The search direction could be specified as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices


was (Author: edponce):
Is the structure of the example ChunkedArray common (ie., having many chunks of with same number of rows)? If that is the case, this type of access pattern would benefit from a "FixedSizeChunkedArray" (doesn't exists) where all chunks are of the same length and thus chunk access would be O(1). This can be implemented without defining a new class but by simply having a flag the ChunkedArray uses to track if chunks are of the same size.

Now, wrt to using a binary search instead of a linear search for finding the chunk of interest, I expect the binary search to improve access time for high-value indices but worsen access time for low-value indices due to the overhead of performing binary search. The overall access time will depend on the application and access patterns and although a binary search would make the overall chunk finding more consistent it will also have its drawback for certain cases.

_Food for thought:_ An alternative solution is to allow the client code to specify the direction of the linear search. This will help control performance based on the expected access patterns. The search direction could be specified as an object attribute or function parameter.
* *forward* - begins at first chunk and is useful for low-value indices
* *backward* - begins search at last chunk and is useful for high-value indices

> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --------------------------------------------------------------------------
>
>                 Key: ARROW-11989
>                 URL: https://issues.apache.org/jira/browse/ARROW-11989
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>    Affects Versions: 3.0.0
>            Reporter: quentin lhoest
>            Priority: Major
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0%	: 6.7μs
> Time to access example at i=10%	: 7.2μs
> Time to access example at i=20%	: 9.1μs
> Time to access example at i=30%	: 11.4μs
> Time to access example at i=40%	: 13.8μs
> Time to access example at i=50%	: 16.2μs
> Time to access example at i=60%	: 18.7μs
> Time to access example at i=70%	: 21.1μs
> Time to access example at i=80%	: 26.8μs
> Time to access example at i=90%	: 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on the index we want to fetch (from the first example at 0% to the example at 90% of the length of the table).
> You can take a look at the code that produces this benchmark [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E] suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !



--
This message was sent by Atlassian Jira
(v8.20.1#820001)