You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/06 13:40:08 UTC

[GitHub] [arrow] n3world opened a new pull request #10662: ARROW-13252: [C++] Add offset to error reporting

n3world opened a new pull request #10662:
URL: https://github.com/apache/arrow/pull/10662


   Add the offset of the start of a row to parsing and decoding errors.
   
   In order to know the row offset during decoding a new array of offsets was added to DataBatch which indicate the offset of the row relative to the offset of the batch. This adds an overhead of an additional vector and an array of ints with one int per row


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-909720129


   @pitrou or @westonpace any opinions on this MR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-942316179


   > Sorry for not answering earlier; but the degradation looks quite concerning indeed. A minor difference (e.g. 3-4%) would be ok, but we seem to be getting a 30% slowdown in some of these cases.
   
   I agree that the performance change seems severe. I am not sure why adding ints to a second buffer would affect the performance that much. I also was getting wildly different values between runs on the same branch sometimes almost a 50% change so I am not sure how reliable any of those numbers really are. I can play with it some to see if I can figure out what is going on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world edited a comment on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world edited a comment on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-918737367


   > Sorry for the wait for feedback. Have you run the parsing benchmarks in `parser_benchmark.cc`? Does capturing the offset have any noticeable effect on performance?
   
   I was getting some wide variations between runs but these are the best numbers I got for master and this branch
   master:
   
   ```
    ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         168985 ns       168981 ns         4117 bytes_per_second=959.423M/s
   ChunkCSVEscapedBlock        155557 ns       155555 ns         4460 bytes_per_second=980.924M/s
   ChunkCSVNoNewlinesBlock        147 ns          147 ns      4741333 bytes_per_second=0/s
   ParseCSVQuotedBlock         263473 ns       263469 ns         2646 bytes_per_second=615.346M/s
   ParseCSVEscapedBlock        209135 ns       209132 ns         3362 bytes_per_second=729.624M/s
   ParseCSVFlightsExample     2175256 ns      2175243 ns          320 bytes_per_second=446.642M/s
   ParseCSVVehiclesExample   15967256 ns     15967111 ns           44 bytes_per_second=718.222M/s
   ParseCSVStocksExample      3463566 ns      3463298 ns          203 bytes_per_second=605.926M/s
   ```
   
   This branch:
   ```
    ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         169009 ns       169006 ns         4093 bytes_per_second=959.283M/s
   ChunkCSVEscapedBlock        156445 ns       156443 ns         4467 bytes_per_second=975.356M/s
   ChunkCSVNoNewlinesBlock        149 ns          149 ns      4749759 bytes_per_second=0/s
   ParseCSVQuotedBlock         369561 ns       369551 ns         1882 bytes_per_second=438.707M/s
   ParseCSVEscapedBlock        367681 ns       367671 ns         1867 bytes_per_second=415.012M/s
   ParseCSVFlightsExample     2538161 ns      2538102 ns          278 bytes_per_second=382.788M/s
   ParseCSVVehiclesExample   16641194 ns     16639585 ns           42 bytes_per_second=689.196M/s
   ParseCSVStocksExample      3119450 ns      3119364 ns          224 bytes_per_second=672.734M/s
   ```
   
   No significant difference that I can see


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-1033833088


   It seems like this functionality is not desired. Should I just close the PR and associated jira ticket?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-965495803


   While the micro-benchmarks don't show any regression, it's quite conceivable that a more memory-heavy workload would show a regression. After all, we're building up an additional buffer, even though it's smaller than the values buffer. So I'm a bit lukewarm on this, especially as the motivation is a rather niche use case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world edited a comment on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world edited a comment on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-942799576


   I think I figured out the issue. It was the overhead of checking the buffer bounds on every append. I changed it back to not checking bounds for the fixed size buffer and numbers look much better
   
   ```
   ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         174105 ns       174102 ns         4012 bytes_per_second=931.204M/s
   ChunkCSVEscapedBlock        143420 ns       143417 ns         4863 bytes_per_second=1063.95M/s
   ChunkCSVNoNewlinesBlock        179 ns          179 ns      4276522 bytes_per_second=0/s
   ParseCSVQuotedBlock         286099 ns       286084 ns         2434 bytes_per_second=566.703M/s
   ParseCSVEscapedBlock        267409 ns       267403 ns         2612 bytes_per_second=570.629M/s
   ParseCSVFlightsExample     2238658 ns      2238612 ns          310 bytes_per_second=433.999M/s
   ParseCSVVehiclesExample   16297590 ns     16297154 ns           43 bytes_per_second=703.677M/s
   ParseCSVStocksExample      3236563 ns      3236484 ns          216 bytes_per_second=648.39M/s
   ```
   
   Those numbers are about what I am getting now for master too


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world edited a comment on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world edited a comment on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-918737367


   > Sorry for the wait for feedback. Have you run the parsing benchmarks in `parser_benchmark.cc`? Does capturing the offset have any noticeable effect on performance?
   
   I was getting some wide variations between runs but these are the best numbers I got for master and this branch
   master:
   
   ```
    ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         168985 ns       168981 ns         4117 bytes_per_second=959.423M/s
   ChunkCSVEscapedBlock        155557 ns       155555 ns         4460 bytes_per_second=980.924M/s
   ChunkCSVNoNewlinesBlock        147 ns          147 ns      4741333 bytes_per_second=0/s
   ParseCSVQuotedBlock         263473 ns       263469 ns         2646 bytes_per_second=615.346M/s
   ParseCSVEscapedBlock        209135 ns       209132 ns         3362 bytes_per_second=729.624M/s
   ParseCSVFlightsExample     2175256 ns      2175243 ns          320 bytes_per_second=446.642M/s
   ParseCSVVehiclesExample   15967256 ns     15967111 ns           44 bytes_per_second=718.222M/s
   ParseCSVStocksExample      3463566 ns      3463298 ns          203 bytes_per_second=605.926M/s
   ```
   
   This branch:
   ```
    ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         169009 ns       169006 ns         4093 bytes_per_second=959.283M/s
   ChunkCSVEscapedBlock        156445 ns       156443 ns         4467 bytes_per_second=975.356M/s
   ChunkCSVNoNewlinesBlock        149 ns          149 ns      4749759 bytes_per_second=0/s
   ParseCSVQuotedBlock         369561 ns       369551 ns         1882 bytes_per_second=438.707M/s
   ParseCSVEscapedBlock        367681 ns       367671 ns         1867 bytes_per_second=415.012M/s
   ParseCSVFlightsExample     2538161 ns      2538102 ns          278 bytes_per_second=382.788M/s
   ParseCSVVehiclesExample   16641194 ns     16639585 ns           42 bytes_per_second=689.196M/s
   ParseCSVStocksExample      3119450 ns      3119364 ns          224 bytes_per_second=672.734M/s
   ```
   
   Most have no significant difference but ParseCSVQuotedBlock, ParseCSVEscapedBlock and ParseCSVFlightsExample did show performance degradation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#discussion_r707644015



##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -742,6 +751,25 @@ class ReaderMixin {
     return ParseResult{std::move(parser), static_cast<int64_t>(parsed_size)};
   }
 
+  /**
+   * @brief add the `size()` of all buffers in `block` to `parser_offset_`
+   * @return the parser offset which should be used by the caller
+   */
+  int64_t UpdateParserOffset(const CSVBlock& block) {
+    parser_offset_ += block.bytes_skipped;
+    auto result = parser_offset_;

Review comment:
       It's been a hard habit for me to break, as I used to do the same thing, but `result` is not a good variable name.  Maybe `updated_offset` or something like that?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-959217878


   Any opinion on this @pitrou or @westonpace ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-959217878






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-909720129


   @pitrou or @westonpace any opinions on this MR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10662: ARROW-13252: [C++] Add offset to error reporting

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-874770122


   https://issues.apache.org/jira/browse/ARROW-13252


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#discussion_r707619719



##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
   int64_t saved_parsed_size_;
 };
 
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+  explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+      : size_(0), capacity_(capacity), mark_(-1) {
+    buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+    typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+  }
+
+  void Append(const Type& value) {

Review comment:
       This should return `Status`.  OOM is a real consideration and it should be, ideally, handled gracefully.

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
   int64_t saved_parsed_size_;
 };
 
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+  explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+      : size_(0), capacity_(capacity), mark_(-1) {
+    buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+    typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+  }
+
+  void Append(const Type& value) {
+    if (ARROW_PREDICT_FALSE(size_ == capacity_)) {
+      capacity_ *= 2;
+      ARROW_CHECK_OK(buffer_->Resize(capacity_ * sizeof(Type)));

Review comment:
       `RETURN_NOT_OK`

##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -391,17 +391,20 @@ namespace {
 class BlockParsingOperator {
  public:
   BlockParsingOperator(io::IOContext io_context, ParseOptions parse_options,
-                       int num_csv_cols, int64_t first_row)
+                       int num_csv_cols, int64_t first_row, int64_t offset)
       : io_context_(io_context),
         parse_options_(parse_options),
         num_csv_cols_(num_csv_cols),
         count_rows_(first_row >= 0),
-        num_rows_seen_(first_row) {}
+        num_rows_seen_(first_row),
+        next_block_offset_(offset) {}
 
   Result<ParsedBlock> operator()(const CSVBlock& block) {
     constexpr int32_t max_num_rows = std::numeric_limits<int32_t>::max();
-    auto parser = std::make_shared<BlockParser>(
-        io_context_.pool(), parse_options_, num_csv_cols_, num_rows_seen_, max_num_rows);
+    next_block_offset_ += block.bytes_skipped;

Review comment:
       At the moment streaming parsing is always serial but that is due to a bug/limitation (ARROW-13155).  Eventually, `operator()` will be called multiple times at once (across multiple blocks).  I think this breaks both `num_rows_seen_` and `next_block_offset_` but we can fix it then.

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
   int64_t saved_parsed_size_;
 };
 
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+  explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+      : size_(0), capacity_(capacity), mark_(-1) {
+    buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+    typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+  }
+
+  void Append(const Type& value) {

Review comment:
       Ah, I see now this is just how it was before.  I don't think you need to fix it in this PR then.  @pitrou do you know if this was done for performance reasons?

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
   int64_t saved_parsed_size_;
 };
 
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {

Review comment:
       I'm not sure I like the name `TypedXyz` makes me think of "types" which, in Arrow, usually means `DataType`.  So I would expect something called a `TypedBuffer` to be similar to an array.  Also, `Buffer` I would think of as something that has a fixed size.  Maybe `ResettableBufferBuilder` or `ResettableBuilder`?  Then the name is pretty similar to the builders we have but the functionality is pretty similar too so I think that fits.

##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -742,6 +751,25 @@ class ReaderMixin {
     return ParseResult{std::move(parser), static_cast<int64_t>(parsed_size)};
   }
 
+  /**
+   * @brief add the `size()` of all buffers in `block` to `parser_offset_`
+   * @return the parser offset which should be used by the caller
+   */
+  int64_t UpdateParserOffset(const CSVBlock& block) {
+    parser_offset_ += block.bytes_skipped;
+    auto result = parser_offset_;

Review comment:
       It's been a hard habit for me to break but `result` is not a good variable name.  Maybe `updated_offset` or something like that?

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
   int64_t saved_parsed_size_;
 };
 
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+  explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+      : size_(0), capacity_(capacity), mark_(-1) {
+    buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+    typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+  }
+
+  void Append(const Type& value) {
+    if (ARROW_PREDICT_FALSE(size_ == capacity_)) {
+      capacity_ *= 2;
+      ARROW_CHECK_OK(buffer_->Resize(capacity_ * sizeof(Type)));
+      typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+    }
+    typed_buffer_[size_++] = value;
+  }
+
+  Status Shrink() {
+    ARROW_CHECK_OK(buffer_->Resize(size_ * sizeof(Type)));
+    return arrow::Status::OK();

Review comment:
       I'm quite certain that `PoolBuffer` will reallocate if a resize shrinks the capacity enough to fit into a smaller power of 2.  As such, you should probably be reassigning `typed_buffer_` here.  Although,  in usage, it appears you only call `Shrink` when you're finished so maybe just name it `Finish` and have it return the `buffer` and get rid of `buffer()`?

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
   int64_t saved_parsed_size_;
 };
 
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+  explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+      : size_(0), capacity_(capacity), mark_(-1) {
+    buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+    typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+  }
+
+  void Append(const Type& value) {
+    if (ARROW_PREDICT_FALSE(size_ == capacity_)) {
+      capacity_ *= 2;
+      ARROW_CHECK_OK(buffer_->Resize(capacity_ * sizeof(Type)));
+      typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+    }
+    typed_buffer_[size_++] = value;
+  }
+
+  Status Shrink() {
+    ARROW_CHECK_OK(buffer_->Resize(size_ * sizeof(Type)));

Review comment:
       `RETURN_NOT_OK`.  I could maybe be convinced this would never fail since you should be going to a smaller size but since you're returning `Status` anyways it's probably better to be consistent and who knows how allocators choose to work.

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
   int64_t saved_parsed_size_;
 };
 
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+  explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+      : size_(0), capacity_(capacity), mark_(-1) {
+    buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+    typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+  }
+
+  void Append(const Type& value) {

Review comment:
       This should be `void Append(Type value)` to be consistent with other usage (ParsedValueDesc is passed by value)

##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -742,6 +751,25 @@ class ReaderMixin {
     return ParseResult{std::move(parser), static_cast<int64_t>(parsed_size)};
   }
 
+  /**
+   * @brief add the `size()` of all buffers in `block` to `parser_offset_`
+   * @return the parser offset which should be used by the caller

Review comment:
       Use `\brief` and `\return` and `///` comments




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-934508373


   > Most have no significant difference but ParseCSVQuotedBlock, ParseCSVEscapedBlock and ParseCSVFlightsExample did show performance degradation
   
   Sorry for not answering earlier; but the degradation looks quite concerning indeed. A minor difference (e.g. 3-4%) would be ok, but we seem to be getting a 30% slowdown in some of these cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world closed pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world closed pull request #10662:
URL: https://github.com/apache/arrow/pull/10662


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world closed pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world closed pull request #10662:
URL: https://github.com/apache/arrow/pull/10662


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-965546170


   > While the micro-benchmarks don't show any regression, it's quite conceivable that a more memory-heavy workload would show a regression.
   
   That could be an issue if the system is memory constrained relative to the size of the csv
   
   > So I'm a bit lukewarm on this, especially as the motivation is a rather niche use case.
   
   I'm not sure how returning the offset of an error is a niche use case especially when using threaded parsing the line number is not known so when an error occurs there is no line number to indicate where the error was encountered
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on a change in pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on a change in pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#discussion_r707847507



##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
   int64_t saved_parsed_size_;
 };
 
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+  explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+      : size_(0), capacity_(capacity), mark_(-1) {
+    buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+    typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+  }
+
+  void Append(const Type& value) {
+    if (ARROW_PREDICT_FALSE(size_ == capacity_)) {
+      capacity_ *= 2;
+      ARROW_CHECK_OK(buffer_->Resize(capacity_ * sizeof(Type)));

Review comment:
       As noted before this is a void method, so I cannot return the status here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-965595707


   > That could be an issue if the system is memory constrained relative to the size of the csv
   
   More generally though, more allocations and more memory writing can increase the impact on the CPU cache and memory subsystem, even if lots of RAM are still available.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-942799576


   I think I figured out the issue. It was the overhead of checking the buffer bounds on every append. I changed it back to not checking bounds for the fixed size buffer and numbers look much better
   
   ```
   ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         174105 ns       174102 ns         4012 bytes_per_second=931.204M/s
   ChunkCSVEscapedBlock        143420 ns       143417 ns         4863 bytes_per_second=1063.95M/s
   ChunkCSVNoNewlinesBlock        179 ns          179 ns      4276522 bytes_per_second=0/s
   ParseCSVQuotedBlock         286099 ns       286084 ns         2434 bytes_per_second=566.703M/s
   ParseCSVEscapedBlock        267409 ns       267403 ns         2612 bytes_per_second=570.629M/s
   ParseCSVFlightsExample     2238658 ns      2238612 ns          310 bytes_per_second=433.999M/s
   ParseCSVVehiclesExample   16297590 ns     16297154 ns           43 bytes_per_second=703.677M/s
   ParseCSVStocksExample      3236563 ns      3236484 ns          216 bytes_per_second=648.39M/s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-959217878


   Any opinion on this @pitrou or @westonpace ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world edited a comment on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world edited a comment on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-918737367


   > Sorry for the wait for feedback. Have you run the parsing benchmarks in `parser_benchmark.cc`? Does capturing the offset have any noticeable effect on performance?
   
   I was getting some wide variations between runs but these are the best numbers I got for master and this branch
   master:
   
   ` ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         168985 ns       168981 ns         4117 bytes_per_second=959.423M/s
   ChunkCSVEscapedBlock        155557 ns       155555 ns         4460 bytes_per_second=980.924M/s
   ChunkCSVNoNewlinesBlock        147 ns          147 ns      4741333 bytes_per_second=0/s
   ParseCSVQuotedBlock         263473 ns       263469 ns         2646 bytes_per_second=615.346M/s
   ParseCSVEscapedBlock        209135 ns       209132 ns         3362 bytes_per_second=729.624M/s
   ParseCSVFlightsExample     2175256 ns      2175243 ns          320 bytes_per_second=446.642M/s
   ParseCSVVehiclesExample   15967256 ns     15967111 ns           44 bytes_per_second=718.222M/s
   ParseCSVStocksExample      3463566 ns      3463298 ns          203 bytes_per_second=605.926M/s`
   
   This branch:
   ` ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         169009 ns       169006 ns         4093 bytes_per_second=959.283M/s
   ChunkCSVEscapedBlock        156445 ns       156443 ns         4467 bytes_per_second=975.356M/s
   ChunkCSVNoNewlinesBlock        149 ns          149 ns      4749759 bytes_per_second=0/s
   ParseCSVQuotedBlock         369561 ns       369551 ns         1882 bytes_per_second=438.707M/s
   ParseCSVEscapedBlock        367681 ns       367671 ns         1867 bytes_per_second=415.012M/s
   ParseCSVFlightsExample     2538161 ns      2538102 ns          278 bytes_per_second=382.788M/s
   ParseCSVVehiclesExample   16641194 ns     16639585 ns           42 bytes_per_second=689.196M/s
   ParseCSVStocksExample      3119450 ns      3119364 ns          224 bytes_per_second=672.734M/s`
   
   No significant difference that I can see


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-918737367


   > Sorry for the wait for feedback. Have you run the parsing benchmarks in `parser_benchmark.cc`? Does capturing the offset have any noticeable effect on performance?
   
   I was getting some wide variations between runs but these are the best numbers I got for master and this branch
   master:
   
   > ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         168985 ns       168981 ns         4117 bytes_per_second=959.423M/s
   ChunkCSVEscapedBlock        155557 ns       155555 ns         4460 bytes_per_second=980.924M/s
   ChunkCSVNoNewlinesBlock        147 ns          147 ns      4741333 bytes_per_second=0/s
   ParseCSVQuotedBlock         263473 ns       263469 ns         2646 bytes_per_second=615.346M/s
   ParseCSVEscapedBlock        209135 ns       209132 ns         3362 bytes_per_second=729.624M/s
   ParseCSVFlightsExample     2175256 ns      2175243 ns          320 bytes_per_second=446.642M/s
   ParseCSVVehiclesExample   15967256 ns     15967111 ns           44 bytes_per_second=718.222M/s
   ParseCSVStocksExample      3463566 ns      3463298 ns          203 bytes_per_second=605.926M/s
   
   This branch:
   > ----------------------------------------------------------------------------------
   Benchmark                        Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------
   ChunkCSVQuotedBlock         169009 ns       169006 ns         4093 bytes_per_second=959.283M/s
   ChunkCSVEscapedBlock        156445 ns       156443 ns         4467 bytes_per_second=975.356M/s
   ChunkCSVNoNewlinesBlock        149 ns          149 ns      4749759 bytes_per_second=0/s
   ParseCSVQuotedBlock         369561 ns       369551 ns         1882 bytes_per_second=438.707M/s
   ParseCSVEscapedBlock        367681 ns       367671 ns         1867 bytes_per_second=415.012M/s
   ParseCSVFlightsExample     2538161 ns      2538102 ns          278 bytes_per_second=382.788M/s
   ParseCSVVehiclesExample   16641194 ns     16639585 ns           42 bytes_per_second=689.196M/s
   ParseCSVStocksExample      3119450 ns      3119364 ns          224 bytes_per_second=672.734M/s
   
   No significant difference that I can see


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

pitrou commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-1033983242


   @n3world Sorry. Judging by the amount of complexity and runtime overhead it adds, I think it's not desirable indeed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting

Posted by GitBox <gi...@apache.org>.

n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-942316179


   > Sorry for not answering earlier; but the degradation looks quite concerning indeed. A minor difference (e.g. 3-4%) would be ok, but we seem to be getting a 30% slowdown in some of these cases.
   
   I agree that the performance change seems severe. I am not sure why adding ints to a second buffer would affect the performance that much. I also was getting wildly different values between runs on the same branch sometimes almost a 50% change so I am not sure how reliable any of those numbers really are. I can play with it some to see if I can figure out what is going on.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org