You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "anjakefala (via GitHub)" <gi...@apache.org> on 2023/03/11 01:26:44 UTC

[GitHub] [arrow] anjakefala opened a new issue, #34535: Request to move ChunkResolver to public API

anjakefala opened a new issue, #34535:
URL: https://github.com/apache/arrow/issues/34535

   ### Describe the enhancement requested
   
   [ChunkResolver.Resolve](https://github.com/apache/arrow/blob/45918a90a6ca1cf3fd67c256a7d6a240249e555a/cpp/src/arrow/chunk_resolver.h#L35) is used by `ChunkedArray.getStatic()` in order to identify which chunk, and which index into that chunk, correlate with a given index into the whole array. `getStatic` then does additional work to wrap the value into a  `Result<std::shared_ptr<Scalar>>`. The creation and return of a `shared_ptr` does result in some performance overhead, that makes a difference for a performance-sensitive application.
   
   If someone could use `ChunkResolver` to learn the indices, they could then instead access the data directly. This would provide a more efficient way of indexing into a `ChunkedArray`, should an application benefit from it.
   
   Would it be possible to move `ChunkResolver` to public API? 
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1965495787

   > Now it becomes impossible to have a non-inline function [1] in the ChunkResolver because the entire class is a template. That's why these abstractions are separate. Care was taken to allow ChunkResolver to work for any type by making it deal with chunks in terms of lengths and offsets.
   
   I see, appreciate you taking the time to explain! I'll have to read up on this because it conflicts with my general understanding of the `inline` keyword. To clarify my current understanding - based on [cppreference](https://en.cppreference.com/w/cpp/language/inline) and other info like in [this SO answer](https://stackoverflow.com/a/1759575), my understanding is that the `inline` keyword purely controls the linkage of the symbol, and whether the function should actually be _inlined_ is done at the compiler's discretion (regardless of whether the function is marked `inline`). Based on this, my understanding was that methods in a class template could still be inlined if the compiler deemed it appropriate.
   
   Nevertheless, I'll take your word for it that this is not possible as I assume you have more experience on this matter; I'd love to read any materials you might have that delve further into this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "anjakefala (via GitHub)" <gi...@apache.org>.
anjakefala commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1927686962

   Oh thanks for letting me know @SChakravorti21!
   
   It absolutely would be. @felipecrv has recently been putting love into the `ChunkResolver` documentation, so that step might actually be shorter: https://github.com/apache/arrow/issues/39815.
   
   Please give us a poke here when you have a PR open. =)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1965371836

   > For instance, when sorting a chunked array, we don't want to keep copying that same set of shared_ptrs over and over.
   
   Sorry if I wasn't clear - the `ChunkedArrayResolver` returns raw pointers to `Array`s, not shared pointers, so it avoids the atomic reference counting overhead. Do you feel like it would be acceptable to do it this way in `ChunkResolver`? I was thinking we could do something like this (pseudo-code):
   
   ```cpp
   /// BEGIN: Copied from `chunked_internal.h`
   
   template <typename ArrayType>
   struct ResolvedChunk {
     using ViewType = GetViewType<typename ArrayType::TypeClass>;
     using LogicalValueType = typename ViewType::T;
   
     const ArrayType* array;
     const int64_t index;
   
     ResolvedChunk(const ArrayType* array, int64_t index) : array(array), index(index) {}
   
     bool IsNull() const { return array->IsNull(index); }
   
     LogicalValueType Value() const { return ViewType::LogicalValue(array->GetView(index)); }
   };
   
   template <>
   struct ResolvedChunk<Array> {
     const Array* array;
     const int64_t index;
   
     ResolvedChunk(const Array* array, int64_t index) : array(array), index(index) {}
   
     bool IsNull() const { return array->IsNull(index); }
   };
   
   /// END: Copied from `chunked_internal.h`
   
   struct ChunkLocation {
     int64_t chunk_index;
     int64_t index_in_chunk;
   };
   
   template <typename ArrayType = Array>
   class ChunkResolver {
    public:
     inline ResolvedChunk<ArrayType> Resolve(int64_t index) const;
     inline ChunkLocation            ResolveLocation(int64_t index) const;
     inline ChunkLocation            ResolveLocationWithHint(ChunkLocation hint) const;
   
    private:
     std::vector<int64_t> offsets_;
     mutable std::atomic<int64_t> cached_chunk_;
   
     // This contains `shared_ptr`s to simplify usage for some applications (to avoid
     // having to hold onto the original Arrays), but we would only return raw pointers
     // to avoid the reference counting overhead of copying `shared_ptr`s.
     std::vector<std::shared_ptr<Array>> chunks_;
   };
   ```
   
   I agree that maintaining the performance of `ChunkResolver` is absolutely critical, and I think this solution avoids adding any unnecessary overhead for people who choose to use `ResolveLocation` and `ResolveLocationWithHint`.
   
   > That and the fact that you would have to type the functions based on all the types that are used to represent an array of array chunks.
   
   I'm sorry, I'm not quite sure I understood this part. Could you clarify?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1965431882

   >> That and the fact that you would have to type the functions based on all the types that are used to represent an array of >> array chunks.
   >
   > I'm sorry, I'm not quite sure I understood this part. Could you clarify?
   
   The resolution depends on what chunked thing you're targetting:
   
   ```cpp
     explicit ChunkResolver(const ArrayVector& chunks);
     explicit ChunkResolver(const std::vector<const Array*>& chunks);
     explicit ChunkResolver(const RecordBatchVector& batches);
   ```
   
   This is the problem (code from your proposal):
   
   ```cpp
   template <typename ArrayType = Array>
   class ChunkResolver {
   ```
   
   Now it becomes impossible to have a non-inline function [1] in the `ChunkResolver` because the entire class is a template. That's why these abstractions are separate. Care was taken to allow `ChunkResolver` to work for any type by making it deal with chunks in terms of lengths and offsets.
   
   [1] I have a plan to add batched/vectorized versions of `ChunkResolver::Resolve` with most of the code living in the (to be added) `chunk_resolver.cc`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1966755978

   >  towards compile times
   
   Not primarily, binary size is the primary concern. As it is now, we don't have to inline the `ChunkResolver` constructors, for instance.
   
   > In that case, we can have the Resolve method be templated rather than the entire class
   
   You can have crashes if the user passes the wrong type parameter to the method call and in your solution, you're adding `std::vector<std::shared_ptr<Array>> chunks_;` to the list of members.
   
   There is a way here based on **composition** that you will probably like. We can bring most of `chunked_internal.h` to here (or a different header that includes `chunk_resolver.h`) but improve on the design as it will now become public.
   
   I recommend you to do that **on a separate PR** as `chunked_internal.h` is much more dependent on other internal stuff.
   
   ```cpp
   template <typename ArrayType = std::shared_ptr<Array>>
   class ArrayChunkResolver {
    public:
     using ArrayPtr = /* template thingie that gives you the raw-est pointer type from `ArrayType` */;
   
    private:
     ChunkResolver _resolver; // owns a vector of chunk offsets and assumes chunk sizes are immutable
     std::vector<ArrayType> &chunks_;  // the resolver doesn't own the vector of chunks
   
    public:
     // ctor, move-ctor, and move-assign.
     
     inline ArrayPtr Resolve(int64_t index) const;
     inline ChunkLocation            ResolveLocation(int64_t index) const; // delegate to _resolver
     inline ChunkLocation            ResolveLocationWithHint(ChunkLocation hint) const; // delegate to _resolver
   };
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1969516244

   First step: https://github.com/apache/arrow/pull/40281
   
   In a following step, I might add a template param to `class ChunkedArrayResolver`, but it would be to accept `Array`, `ArrayData`, `RecordBatch`... and not value-type specific arrays like `Int32Array`, `BoolArray`, `StringArray`... these were the types that were being passed to `ChunkedArrayResolver::Resolve` and shouldn't really be. That's what the PR linked above changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1977304538

   >  it might be good enough to make ChunkResolver/ChunkedArrayResolver public
   
   There is a case for making `ChunkResolver` public already. Please create a separate issue to talk about `ChunkedArrayResolver`.
   
   Without knowing what `do_business_logic` is, I can't say it's impossible to process column by column (and I don't mean with vector instructions, just memory locality effects alone being beneficial). To use columnar representation well, the business logic has to be aware of the columnar representation.
   
   But my main problem with your code (and let me be more direct this time with what I mean by "random access") is that you're using `Resolve`/`Value` (which is `O(log(num_chunks))`) on every iteration when you could be incrementing each `ChunkLocation` in O(1) without having to rely on the caching in `Resolve` to make "O(1) most of the time" + overhead.
   
   To that, we can add these two to `ChunkResolver`:
   
   ```cpp
     /// \pre loc.chunk_index >= 0
     /// \pre loc.index_in_chunk is assumed valid if chunk_index is not the last one
     inline bool Valid(ChunkLocation loc) const {
       const int64_t last_chunk_index = static_cast<int64_t>(offsets_.size()) - 1;
       return loc.chunk_index + 1 < last_chunk_index ||
              (loc.chunk_index + 1 == last_chunk_index &&
               loc.index_in_chunk < offsets_[last_chunk_index]);
     }
   
     /// \pre Valid(loc)
     inline ChunkLocation Next(ChunkLocation loc) const {
       const int64_t next_index_in_chunk = loc.index_in_chunk + 1;
       return (next_index_in_chunk < offsets_[loc.chunk_index + 1])
                  ? ChunkLocation{loc.chunk_index, next_index_in_chunk}
                  : ChunkLocation{loc.chunk_index + 1, 0};
     }
   ```
   
   Then your loops can be:
   
   ```cpp
   ChunkResolver resolver(batches);
   for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
     // re-use loc for all the typed columns since they are split on the same offsets
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1927255323

   Hey @anjakefala and @westonpace! Is this something that would still be reasonable to work on? I'd like to work on this if so, and my plan would be:
   
   - [ ] Check unit test coverage and expand as necessary to handle edge cases
   - [ ] Move `ChunkResolver` out of the `internal` namespace
     - [ ] Refactor current usages of the class in Arrow C++ as necessary
   - [ ] Update the documentation
     - [ ] Update the API/reference docs as necessary
     - [ ] Add a section, perhaps in the User Guide or Examples, on how to use this class


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1977858863

   ```cpp
   for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
       // what is the most efficient way to access the values for each column here?
       
       // I'm not sure this is the most efficient, but it's certainly better than using ChunkedArrayResolver for every array.
       // don't forget the null checks (missing here)
       int64_t a =  checked_cast<Int64Array>(chunks[loc.chunk_index]).Value(loc.index_in_chunk);
       std::string_view b = checked_cast<StringArray>(chunks[loc.chunk_index]).Value(loc.index_in_chunk);
       process_row_oh_no(a, b);
   }
   ```
   
   > I've been thinking about this and came up with an alternative way that may be better, and would be interested to hear your thoughts on it (pseudocode):
   
   Your approach works even better and doesn't need the `ChunkResolver`. What I've been trying to say is that you should never need the binary-search that `ChunkResolver` (and `ChunkedArrayResolver`) provides if you're looping SEQUENTIALLY over chunked arrays.
   
   > We can also document that these classes should never be used for pure sequential access, and something like above should be preferred instead. How does that sound?
   
   Well... documenting won't deter people from using it. See how much we had to discuss here so this nuance could be understood. The better plan is to implement things that need random access inside Arrow (sort, ranking, take, filter, joins...) and have users use that instead of reach to random access directly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1928490921

   @felipecrv Sounds like a reasonable change, I agree that it looks easier to use!
   
   Will let you folks know when I've opened a PR 👍🏼


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1971570218

   Hey @felipecrv! Sorry I have been unresponsive the last few days, was busy with work and couldn't find much free time outside of work.
   
   Everything you have said makes sense to me. I took a look at https://github.com/apache/arrow/pull/40281 and the changes there look great to me.
   
   > We can bring most of `chunked_internal.h` to here (or a different header that includes `chunk_resolver.h`) but improve on the design as it will now become public.
   
   So just to clarify, is the plan to move _both_ `ChunkResolver` and `ChunkedArrayResolver` into the public namespace? I think this would make sense - people can use `ChunkResolver` if they want low-level access to chunk locations, or they can use `ChunkedArrayResolver` if they want easy random access to values by logical index.
   
   The `ChunkedArrayResolver` changes look really nice for applications that need to apply some row-wise business logic to `ChunkedArray`s (which is the use-case I was targeting):
   
   ```cpp
   std::shared_ptr<arrow::Table> table;
   
   arrow::ChunkedArrayResolver resolver_a(table.GetColumnByName("a"));
   arrow::ChunkedArrayResolver resolver_b(table.GetColumnByName("b"));
   
   for (int i = 0; i < table->num_rows(); ++i)
   {
     std::int64_t a = resolver_a.Resolve(i).Value<arrow::Int64Type>();
     std::string_view b = resolver_b.Resolve(i).Value<arrow::StringType>();
     do_business_logic(a, b);
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1965500769

   On second read, I think your second concern might be more targeted towards compile times rather than performance, is that right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34535: Request to move ChunkResolver to public API

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1468665856

   If it's going to be a public API then I think it would be good to have some tests for it.  Otherwise, this seems like a reasonable utility to make available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1974945525

   > If it were to go public, that would be done in another issue/PR pair, so for now it's better to focus just on making `ChunkResolver` public.
   
   Gotcha. Yes, I agree, we can focus on just `ChunkResolver` for now!
   
   > [1] Apache Arrow, as a columnar data library, is built to keep most computation on top of columnar representation, you're of course allowed to do whatever you need to solver your application problem, but Arrow itself exposing row-by-row APIs would pass the wrong message.
   
   I agree 100% that we don't want to send the wrong message, and that people should always try to frame their logic in terms of vectorized operations. That said, there are practical use-cases where there is no way of getting around row-major processing of the data.
   
   I've been thinking about this and came up with an alternative way that may be better, and would be interested to hear your thoughts on it (pseudocode):
   
   ```cpp
   for (auto maybe_batch : arrow::TableBatchReader(*table))
   {
     std::shared_ptr<arrow::RecordBatch> batch = maybe_batch.ValueOrDie();
   
     // User decides whether they want a safe or unsafe cast
     std::shared_ptr<arrow::Int64Array> a = std::dynamic_pointer_cast<arrow::Int64Array>(batch->GetColumnByName("a"));
     std::shared_ptr<arrow::StringArray> b = std::dynamic_pointer_cast<arrow::StringArray>(batch->GetColumnByName("b"));
   
     for (int i = 0; i < batch->num_rows(); ++i) {
       do_business_logic(a.Value(i), b.Value(i));
     }
   }
   ```
   
   This is still sequential but (I think) avoids a lot of unnecessary overhead. In that case, it might be good enough to make `ChunkResolver`/`ChunkedArrayResolver` public only for the sake of use-cases that require random access into `ChunkedArray`s. We can also document that these classes should never be used for pure sequential access, and something like above should be preferred instead. How does that sound?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1965023476

   > ...if we should try to fold ChunkedArrayResolver's behavior into ChunkResolver somehow. Do you have any opinions or concerns about this?
   
   Hmm, it's better to keep `ChunkResolver` like this for now because it's meant as a low-level utility that you use in kernels. For instance, when sorting a chunked array, we don't want to keep copying that same set of shared_ptrs over and over.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1968974101

   @SChakravorti21 scratch that. I had a better idea (even simpler) and will be pushing a PR soon. I have the code ready on my machine.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1927733262

   @SChakravorti21 I had an idea since my changes to the class interface that you could add maybe?
   
   I added
    
   ```cpp
     inline ChunkLocation ResolveWithChunkIndexHint(int64_t index,
                                                    int64_t cached_chunk_index) const {
   ```
   
   I was planning to add a `ResolveWithHint(int64_t index, ChunkLocation hint)` -- simpler name and less error-prone. Its definition is:
   
   ```cpp
   
     inline ChunkLocation ResolveWithHint(int64_t index, ChunkLocation hint) const {
       assert(hint.chunk_index < static_cast<int64_t>(offsets_.size()));
       const auto chunk_index =
           ResolveChunkIndex</*StoreCachedChunk=*/false>(index, hint.chunk_index);
       return {chunk_index, index - offsets_[chunk_index]};
     }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1962806793

   Hey folks, just wanted to give a heads up that I've created a draft PR at https://github.com/apache/arrow/pull/40226 for the work on this so far.
   
   I have a question: I noticed that there is also a [`ChunkedArrayResolver`](https://github.com/apache/arrow/blob/b8fff043c6cb351b1fad87fa0eeaf8dbc550e37c/cpp/src/arrow/compute/kernels/chunked_internal.h#L64) class. This seems to build on top of `ChunkResolver` and provides the added benefit of returning the actual chunk `Array` instead of just its location. I feel like this is convenient for many use-cases and was wondering if we should try to fold `ChunkedArrayResolver`'s behavior into `ChunkResolver` somehow. Do you have any opinions or concerns about this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1972361985

   > So just to clarify, is the plan to move both ChunkResolver and ChunkedArrayResolver into the public namespace?
   
   Not my plan, but after you mentioned you wanted to unify `ChunkResolver` with `ChunkedArrayResolver` I took a look on how it's designed.
   
   If it were to go public, that would be done in another issue/PR pair, so for now it's better to focus just on making `ChunkResolver` public.
   
   About the code:
   
   ```cpp
   for (int i = 0; i < table->num_rows(); ++i)
   {
     std::int64_t a = resolver_a.Resolve(i).Value<arrow::Int64Type>();
     std::string_view b = resolver_b.Resolve(i).Value<arrow::StringType>();
     do_business_logic(a, b);
   }
   ```
   
   It's important to delay the re-construction of row-by-row values as much as possible to preserve the benefits of columnar layouts. So there is a danger in exposing too many APIs that work value-by-value instead or array-by-array. Array-by-array is not compatible with how most people think about programming.
   
   The loop above is performing sequential access, so using the chunk resolver is not the best solution regarding locality of the memory accesses. [1] It's better to keep one `ChunkLocation` per column and increment them all at once. 
   
   
   [1] `ChunkResolver` and `ChunkedArrayResolver` are used internally in Arrow mainly on functions that perform random access: `Take`, `Sort`, `Rank`...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] Request to move ChunkResolver to public API [arrow]

Posted by "SChakravorti21 (via GitHub)" <gi...@apache.org>.
SChakravorti21 commented on issue #34535:
URL: https://github.com/apache/arrow/issues/34535#issuecomment-1977490469

   > There is a case for making `ChunkResolver` public already. Please create a separate issue to talk about `ChunkedArrayResolver`.
   
   For sure, I can make a separate issue for `ChunkedArrayResolver`. My bad for mixing these two discussions.
   
   > But my main problem with your code (and let me be more direct this time with what I mean by "random access") is that you're using `Resolve`/`Value` (which is `O(log(num_chunks)))` on every iteration when you could be incrementing each `ChunkLocation` in O(1) without having to rely on the caching in `Resolve` to make it "O(1) most of the time" + overhead.
   
   That makes sense, I didn't fully understand what you meant previously.
   
   I think the API additions you're suggesting make sense, but I'm confused how someone would use them to iterate over multiple columns simultaneously. Is there such a thing as a "typed `ChunkedArray`"? Otherwise how would we expect someone to access the values inside this loop:
   
   ```cpp
   ChunkResolver resolver(batches);
   
   for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
     // what is the most efficient way to access the values for each column here?
   }
   ```
   
   The benefit of just iterating over the batches themselves is that we only perform the cast from untyped `Array` to typed array (`Int64Array`, `StringArray`, etc.) once per column per batch. This is cheaper if someone prefers to use `dynamic_cast` for safety vs. doing a cast once per datapoint.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org