You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/26 22:15:51 UTC

[GitHub] [arrow] romgrk-comparative commented on issue #10803: Reading strings efficiently in C++

romgrk-comparative commented on issue #10803:
URL: https://github.com/apache/arrow/issues/10803#issuecomment-887064126


   > I'm a little confused by this point. In the code below you are creating two pointers, index should be a pointer to the indices and view should be a pointer to the data. 
   
   You're right, I haven't described everything.
   
   When I inspect the actual data, each string is repeated as many times as it appears in the data. The offsets `index` don't point to the same string even if it's the same value, they point to different strings.
   
   ```c++
   const int64_t length = array->length;
   const int32_t *index = array->GetValues<int32_t>(1, 0);
   const char    *view  = array->GetValues<char>(2, 0);
   
   for (int64_t i = 0; i < length; ++i) {
       auto valueStart = index[i];
       auto valueEnd   = index[i + 1]; // <-- Because the offset is retrieved from
                                       //     the next value's start offset, it's also
                                       //     impossible to point to the same memory
                                       //     region for multiple rows :[
   
       auto valueData = view + valueStart;
       auto valueLength = valueEnd - valueStart;
   
       std::string value(valueData, valueLength);
   
       printf("%s: %i \n", value.c_str(), valueStart);
   }
   ```
   
   Example output:
   ```bash
   a: 0
   b: 1
   c: 2
   a: 3         # Here, I'd want the offset to point to the same memory as the first line
   ...
   ```
   
   I'm wondering if there is something here that I should be doing differently to retrieve the dictionary indexes instead of the raw data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org