You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/26 21:13:51 UTC

[GitHub] [arrow] westonpace edited a comment on pull request #10794: ARROW-13441: [C++][CSV] Skip empty batches in column decoder

westonpace edited a comment on pull request #10794:
URL: https://github.com/apache/arrow/pull/10794#issuecomment-887029318


   Yes, I removed `SerialStreamingReader::ReadNextSkippingEmpty` semi-intentionally as I thought there could only be one skipped batch.  However, I hadn't realized that the right combinations of skip settings could yield multiple empty parsed blocks so that logic doesn't work.  Filtering in parallel is tricky and rather than solve that I think we can get away with filtering on one of the serial spots.
   
   Here is a commit that adds a filtering utility to the async generators: https://github.com/westonpace/arrow/commit/8d80722a8b8cd304dff8322e96d76dfd75899ea7
   
   I think the cleanest might be to filter empty parsed blocks when looking for the first batch.  This should be safe (as long as you pass in rb_gen and not filtered_rb_gen to use after you got the first batch):
   ```
       auto filtered_rb_gen = MakeFilteredGenerator(rb_gen, std::move(not_empty));
       return filtered_rb_gen().Then([self, rb_gen, max_readahead](const DecodedBlock& first_block) {
         return self->InitAfterFirstBatch(first_block, std::move(rb_gen), max_readahead);
       });
   ```
   Or we could filter empty batches on the other side of the readahead generator but then you'd have to figure out the schema problem.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org