You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/06 13:40:08 UTC
[GitHub] [arrow] n3world opened a new pull request #10662: ARROW-13252: [C++] Add offset to error reporting
n3world opened a new pull request #10662:
URL: https://github.com/apache/arrow/pull/10662
Add the offset of the start of a row to parsing and decoding errors.
In order to know the row offset during decoding a new array of offsets was added to DataBatch which indicate the offset of the row relative to the offset of the batch. This adds an overhead of an additional vector and an array of ints with one int per row
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-909720129
@pitrou or @westonpace any opinions on this MR?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-942316179
> Sorry for not answering earlier; but the degradation looks quite concerning indeed. A minor difference (e.g. 3-4%) would be ok, but we seem to be getting a 30% slowdown in some of these cases.
I agree that the performance change seems severe. I am not sure why adding ints to a second buffer would affect the performance that much. I also was getting wildly different values between runs on the same branch sometimes almost a 50% change so I am not sure how reliable any of those numbers really are. I can play with it some to see if I can figure out what is going on.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world edited a comment on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world edited a comment on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-918737367
> Sorry for the wait for feedback. Have you run the parsing benchmarks in `parser_benchmark.cc`? Does capturing the offset have any noticeable effect on performance?
I was getting some wide variations between runs but these are the best numbers I got for master and this branch
master:
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 168985 ns 168981 ns 4117 bytes_per_second=959.423M/s
ChunkCSVEscapedBlock 155557 ns 155555 ns 4460 bytes_per_second=980.924M/s
ChunkCSVNoNewlinesBlock 147 ns 147 ns 4741333 bytes_per_second=0/s
ParseCSVQuotedBlock 263473 ns 263469 ns 2646 bytes_per_second=615.346M/s
ParseCSVEscapedBlock 209135 ns 209132 ns 3362 bytes_per_second=729.624M/s
ParseCSVFlightsExample 2175256 ns 2175243 ns 320 bytes_per_second=446.642M/s
ParseCSVVehiclesExample 15967256 ns 15967111 ns 44 bytes_per_second=718.222M/s
ParseCSVStocksExample 3463566 ns 3463298 ns 203 bytes_per_second=605.926M/s
```
This branch:
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 169009 ns 169006 ns 4093 bytes_per_second=959.283M/s
ChunkCSVEscapedBlock 156445 ns 156443 ns 4467 bytes_per_second=975.356M/s
ChunkCSVNoNewlinesBlock 149 ns 149 ns 4749759 bytes_per_second=0/s
ParseCSVQuotedBlock 369561 ns 369551 ns 1882 bytes_per_second=438.707M/s
ParseCSVEscapedBlock 367681 ns 367671 ns 1867 bytes_per_second=415.012M/s
ParseCSVFlightsExample 2538161 ns 2538102 ns 278 bytes_per_second=382.788M/s
ParseCSVVehiclesExample 16641194 ns 16639585 ns 42 bytes_per_second=689.196M/s
ParseCSVStocksExample 3119450 ns 3119364 ns 224 bytes_per_second=672.734M/s
```
No significant difference that I can see
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-1033833088
It seems like this functionality is not desired. Should I just close the PR and associated jira ticket?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-965495803
While the micro-benchmarks don't show any regression, it's quite conceivable that a more memory-heavy workload would show a regression. After all, we're building up an additional buffer, even though it's smaller than the values buffer. So I'm a bit lukewarm on this, especially as the motivation is a rather niche use case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world edited a comment on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world edited a comment on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-942799576
I think I figured out the issue. It was the overhead of checking the buffer bounds on every append. I changed it back to not checking bounds for the fixed size buffer and numbers look much better
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 174105 ns 174102 ns 4012 bytes_per_second=931.204M/s
ChunkCSVEscapedBlock 143420 ns 143417 ns 4863 bytes_per_second=1063.95M/s
ChunkCSVNoNewlinesBlock 179 ns 179 ns 4276522 bytes_per_second=0/s
ParseCSVQuotedBlock 286099 ns 286084 ns 2434 bytes_per_second=566.703M/s
ParseCSVEscapedBlock 267409 ns 267403 ns 2612 bytes_per_second=570.629M/s
ParseCSVFlightsExample 2238658 ns 2238612 ns 310 bytes_per_second=433.999M/s
ParseCSVVehiclesExample 16297590 ns 16297154 ns 43 bytes_per_second=703.677M/s
ParseCSVStocksExample 3236563 ns 3236484 ns 216 bytes_per_second=648.39M/s
```
Those numbers are about what I am getting now for master too
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world edited a comment on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world edited a comment on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-918737367
> Sorry for the wait for feedback. Have you run the parsing benchmarks in `parser_benchmark.cc`? Does capturing the offset have any noticeable effect on performance?
I was getting some wide variations between runs but these are the best numbers I got for master and this branch
master:
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 168985 ns 168981 ns 4117 bytes_per_second=959.423M/s
ChunkCSVEscapedBlock 155557 ns 155555 ns 4460 bytes_per_second=980.924M/s
ChunkCSVNoNewlinesBlock 147 ns 147 ns 4741333 bytes_per_second=0/s
ParseCSVQuotedBlock 263473 ns 263469 ns 2646 bytes_per_second=615.346M/s
ParseCSVEscapedBlock 209135 ns 209132 ns 3362 bytes_per_second=729.624M/s
ParseCSVFlightsExample 2175256 ns 2175243 ns 320 bytes_per_second=446.642M/s
ParseCSVVehiclesExample 15967256 ns 15967111 ns 44 bytes_per_second=718.222M/s
ParseCSVStocksExample 3463566 ns 3463298 ns 203 bytes_per_second=605.926M/s
```
This branch:
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 169009 ns 169006 ns 4093 bytes_per_second=959.283M/s
ChunkCSVEscapedBlock 156445 ns 156443 ns 4467 bytes_per_second=975.356M/s
ChunkCSVNoNewlinesBlock 149 ns 149 ns 4749759 bytes_per_second=0/s
ParseCSVQuotedBlock 369561 ns 369551 ns 1882 bytes_per_second=438.707M/s
ParseCSVEscapedBlock 367681 ns 367671 ns 1867 bytes_per_second=415.012M/s
ParseCSVFlightsExample 2538161 ns 2538102 ns 278 bytes_per_second=382.788M/s
ParseCSVVehiclesExample 16641194 ns 16639585 ns 42 bytes_per_second=689.196M/s
ParseCSVStocksExample 3119450 ns 3119364 ns 224 bytes_per_second=672.734M/s
```
Most have no significant difference but ParseCSVQuotedBlock, ParseCSVEscapedBlock and ParseCSVFlightsExample did show performance degradation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#discussion_r707644015
##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -742,6 +751,25 @@ class ReaderMixin {
return ParseResult{std::move(parser), static_cast<int64_t>(parsed_size)};
}
+ /**
+ * @brief add the `size()` of all buffers in `block` to `parser_offset_`
+ * @return the parser offset which should be used by the caller
+ */
+ int64_t UpdateParserOffset(const CSVBlock& block) {
+ parser_offset_ += block.bytes_skipped;
+ auto result = parser_offset_;
Review comment:
It's been a hard habit for me to break, as I used to do the same thing, but `result` is not a good variable name. Maybe `updated_offset` or something like that?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-959217878
Any opinion on this @pitrou or @westonpace ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-959217878
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-909720129
@pitrou or @westonpace any opinions on this MR?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10662: ARROW-13252: [C++] Add offset to error reporting
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-874770122
https://issues.apache.org/jira/browse/ARROW-13252
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on a change in pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#discussion_r707619719
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
int64_t saved_parsed_size_;
};
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+ explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+ : size_(0), capacity_(capacity), mark_(-1) {
+ buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+
+ void Append(const Type& value) {
Review comment:
This should return `Status`. OOM is a real consideration and it should be, ideally, handled gracefully.
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
int64_t saved_parsed_size_;
};
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+ explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+ : size_(0), capacity_(capacity), mark_(-1) {
+ buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+
+ void Append(const Type& value) {
+ if (ARROW_PREDICT_FALSE(size_ == capacity_)) {
+ capacity_ *= 2;
+ ARROW_CHECK_OK(buffer_->Resize(capacity_ * sizeof(Type)));
Review comment:
`RETURN_NOT_OK`
##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -391,17 +391,20 @@ namespace {
class BlockParsingOperator {
public:
BlockParsingOperator(io::IOContext io_context, ParseOptions parse_options,
- int num_csv_cols, int64_t first_row)
+ int num_csv_cols, int64_t first_row, int64_t offset)
: io_context_(io_context),
parse_options_(parse_options),
num_csv_cols_(num_csv_cols),
count_rows_(first_row >= 0),
- num_rows_seen_(first_row) {}
+ num_rows_seen_(first_row),
+ next_block_offset_(offset) {}
Result<ParsedBlock> operator()(const CSVBlock& block) {
constexpr int32_t max_num_rows = std::numeric_limits<int32_t>::max();
- auto parser = std::make_shared<BlockParser>(
- io_context_.pool(), parse_options_, num_csv_cols_, num_rows_seen_, max_num_rows);
+ next_block_offset_ += block.bytes_skipped;
Review comment:
At the moment streaming parsing is always serial but that is due to a bug/limitation (ARROW-13155). Eventually, `operator()` will be called multiple times at once (across multiple blocks). I think this breaks both `num_rows_seen_` and `next_block_offset_` but we can fix it then.
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
int64_t saved_parsed_size_;
};
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+ explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+ : size_(0), capacity_(capacity), mark_(-1) {
+ buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+
+ void Append(const Type& value) {
Review comment:
Ah, I see now this is just how it was before. I don't think you need to fix it in this PR then. @pitrou do you know if this was done for performance reasons?
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
int64_t saved_parsed_size_;
};
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
Review comment:
I'm not sure I like the name `TypedXyz` makes me think of "types" which, in Arrow, usually means `DataType`. So I would expect something called a `TypedBuffer` to be similar to an array. Also, `Buffer` I would think of as something that has a fixed size. Maybe `ResettableBufferBuilder` or `ResettableBuilder`? Then the name is pretty similar to the builders we have but the functionality is pretty similar too so I think that fits.
##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -742,6 +751,25 @@ class ReaderMixin {
return ParseResult{std::move(parser), static_cast<int64_t>(parsed_size)};
}
+ /**
+ * @brief add the `size()` of all buffers in `block` to `parser_offset_`
+ * @return the parser offset which should be used by the caller
+ */
+ int64_t UpdateParserOffset(const CSVBlock& block) {
+ parser_offset_ += block.bytes_skipped;
+ auto result = parser_offset_;
Review comment:
It's been a hard habit for me to break but `result` is not a good variable name. Maybe `updated_offset` or something like that?
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
int64_t saved_parsed_size_;
};
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+ explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+ : size_(0), capacity_(capacity), mark_(-1) {
+ buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+
+ void Append(const Type& value) {
+ if (ARROW_PREDICT_FALSE(size_ == capacity_)) {
+ capacity_ *= 2;
+ ARROW_CHECK_OK(buffer_->Resize(capacity_ * sizeof(Type)));
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+ typed_buffer_[size_++] = value;
+ }
+
+ Status Shrink() {
+ ARROW_CHECK_OK(buffer_->Resize(size_ * sizeof(Type)));
+ return arrow::Status::OK();
Review comment:
I'm quite certain that `PoolBuffer` will reallocate if a resize shrinks the capacity enough to fit into a smaller power of 2. As such, you should probably be reassigning `typed_buffer_` here. Although, in usage, it appears you only call `Shrink` when you're finished so maybe just name it `Finish` and have it return the `buffer` and get rid of `buffer()`?
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
int64_t saved_parsed_size_;
};
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+ explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+ : size_(0), capacity_(capacity), mark_(-1) {
+ buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+
+ void Append(const Type& value) {
+ if (ARROW_PREDICT_FALSE(size_ == capacity_)) {
+ capacity_ *= 2;
+ ARROW_CHECK_OK(buffer_->Resize(capacity_ * sizeof(Type)));
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+ typed_buffer_[size_++] = value;
+ }
+
+ Status Shrink() {
+ ARROW_CHECK_OK(buffer_->Resize(size_ * sizeof(Type)));
Review comment:
`RETURN_NOT_OK`. I could maybe be convinced this would never fail since you should be going to a smaller size but since you're returning `Status` anyways it's probably better to be consistent and who knows how allocators choose to work.
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
int64_t saved_parsed_size_;
};
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+ explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+ : size_(0), capacity_(capacity), mark_(-1) {
+ buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+
+ void Append(const Type& value) {
Review comment:
This should be `void Append(Type value)` to be consistent with other usage (ParsedValueDesc is passed by value)
##########
File path: cpp/src/arrow/csv/reader.cc
##########
@@ -742,6 +751,25 @@ class ReaderMixin {
return ParseResult{std::move(parser), static_cast<int64_t>(parsed_size)};
}
+ /**
+ * @brief add the `size()` of all buffers in `block` to `parser_offset_`
+ * @return the parser offset which should be used by the caller
Review comment:
Use `\brief` and `\return` and `///` comments
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-934508373
> Most have no significant difference but ParseCSVQuotedBlock, ParseCSVEscapedBlock and ParseCSVFlightsExample did show performance degradation
Sorry for not answering earlier; but the degradation looks quite concerning indeed. A minor difference (e.g. 3-4%) would be ok, but we seem to be getting a 30% slowdown in some of these cases.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world closed pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world closed pull request #10662:
URL: https://github.com/apache/arrow/pull/10662
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world closed pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world closed pull request #10662:
URL: https://github.com/apache/arrow/pull/10662
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-965546170
> While the micro-benchmarks don't show any regression, it's quite conceivable that a more memory-heavy workload would show a regression.
That could be an issue if the system is memory constrained relative to the size of the csv
> So I'm a bit lukewarm on this, especially as the motivation is a rather niche use case.
I'm not sure how returning the offset of an error is a niche use case especially when using threaded parsing the line number is not known so when an error occurs there is no line number to indicate where the error was encountered
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on a change in pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#discussion_r707847507
##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -100,6 +100,47 @@ class PresizedDataWriter {
int64_t saved_parsed_size_;
};
+/// \brief Utility class for writing one type into a buffer
+/// \tparam Type of the value stored in this buffer
+template <typename Type>
+class TypedBuffer {
+ public:
+ explicit TypedBuffer(MemoryPool* pool, int64_t capacity)
+ : size_(0), capacity_(capacity), mark_(-1) {
+ buffer_ = *AllocateResizableBuffer(capacity_ * sizeof(Type), pool);
+ typed_buffer_ = reinterpret_cast<Type*>(buffer_->mutable_data());
+ }
+
+ void Append(const Type& value) {
+ if (ARROW_PREDICT_FALSE(size_ == capacity_)) {
+ capacity_ *= 2;
+ ARROW_CHECK_OK(buffer_->Resize(capacity_ * sizeof(Type)));
Review comment:
As noted before this is a void method, so I cannot return the status here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-965595707
> That could be an issue if the system is memory constrained relative to the size of the csv
More generally though, more allocations and more memory writing can increase the impact on the CPU cache and memory subsystem, even if lots of RAM are still available.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-942799576
I think I figured out the issue. It was the overhead of checking the buffer bounds on every append. I changed it back to not checking bounds for the fixed size buffer and numbers look much better
```
----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 174105 ns 174102 ns 4012 bytes_per_second=931.204M/s
ChunkCSVEscapedBlock 143420 ns 143417 ns 4863 bytes_per_second=1063.95M/s
ChunkCSVNoNewlinesBlock 179 ns 179 ns 4276522 bytes_per_second=0/s
ParseCSVQuotedBlock 286099 ns 286084 ns 2434 bytes_per_second=566.703M/s
ParseCSVEscapedBlock 267409 ns 267403 ns 2612 bytes_per_second=570.629M/s
ParseCSVFlightsExample 2238658 ns 2238612 ns 310 bytes_per_second=433.999M/s
ParseCSVVehiclesExample 16297590 ns 16297154 ns 43 bytes_per_second=703.677M/s
ParseCSVStocksExample 3236563 ns 3236484 ns 216 bytes_per_second=648.39M/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-959217878
Any opinion on this @pitrou or @westonpace ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world edited a comment on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world edited a comment on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-918737367
> Sorry for the wait for feedback. Have you run the parsing benchmarks in `parser_benchmark.cc`? Does capturing the offset have any noticeable effect on performance?
I was getting some wide variations between runs but these are the best numbers I got for master and this branch
master:
` ----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 168985 ns 168981 ns 4117 bytes_per_second=959.423M/s
ChunkCSVEscapedBlock 155557 ns 155555 ns 4460 bytes_per_second=980.924M/s
ChunkCSVNoNewlinesBlock 147 ns 147 ns 4741333 bytes_per_second=0/s
ParseCSVQuotedBlock 263473 ns 263469 ns 2646 bytes_per_second=615.346M/s
ParseCSVEscapedBlock 209135 ns 209132 ns 3362 bytes_per_second=729.624M/s
ParseCSVFlightsExample 2175256 ns 2175243 ns 320 bytes_per_second=446.642M/s
ParseCSVVehiclesExample 15967256 ns 15967111 ns 44 bytes_per_second=718.222M/s
ParseCSVStocksExample 3463566 ns 3463298 ns 203 bytes_per_second=605.926M/s`
This branch:
` ----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 169009 ns 169006 ns 4093 bytes_per_second=959.283M/s
ChunkCSVEscapedBlock 156445 ns 156443 ns 4467 bytes_per_second=975.356M/s
ChunkCSVNoNewlinesBlock 149 ns 149 ns 4749759 bytes_per_second=0/s
ParseCSVQuotedBlock 369561 ns 369551 ns 1882 bytes_per_second=438.707M/s
ParseCSVEscapedBlock 367681 ns 367671 ns 1867 bytes_per_second=415.012M/s
ParseCSVFlightsExample 2538161 ns 2538102 ns 278 bytes_per_second=382.788M/s
ParseCSVVehiclesExample 16641194 ns 16639585 ns 42 bytes_per_second=689.196M/s
ParseCSVStocksExample 3119450 ns 3119364 ns 224 bytes_per_second=672.734M/s`
No significant difference that I can see
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-918737367
> Sorry for the wait for feedback. Have you run the parsing benchmarks in `parser_benchmark.cc`? Does capturing the offset have any noticeable effect on performance?
I was getting some wide variations between runs but these are the best numbers I got for master and this branch
master:
> ----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 168985 ns 168981 ns 4117 bytes_per_second=959.423M/s
ChunkCSVEscapedBlock 155557 ns 155555 ns 4460 bytes_per_second=980.924M/s
ChunkCSVNoNewlinesBlock 147 ns 147 ns 4741333 bytes_per_second=0/s
ParseCSVQuotedBlock 263473 ns 263469 ns 2646 bytes_per_second=615.346M/s
ParseCSVEscapedBlock 209135 ns 209132 ns 3362 bytes_per_second=729.624M/s
ParseCSVFlightsExample 2175256 ns 2175243 ns 320 bytes_per_second=446.642M/s
ParseCSVVehiclesExample 15967256 ns 15967111 ns 44 bytes_per_second=718.222M/s
ParseCSVStocksExample 3463566 ns 3463298 ns 203 bytes_per_second=605.926M/s
This branch:
> ----------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
----------------------------------------------------------------------------------
ChunkCSVQuotedBlock 169009 ns 169006 ns 4093 bytes_per_second=959.283M/s
ChunkCSVEscapedBlock 156445 ns 156443 ns 4467 bytes_per_second=975.356M/s
ChunkCSVNoNewlinesBlock 149 ns 149 ns 4749759 bytes_per_second=0/s
ParseCSVQuotedBlock 369561 ns 369551 ns 1882 bytes_per_second=438.707M/s
ParseCSVEscapedBlock 367681 ns 367671 ns 1867 bytes_per_second=415.012M/s
ParseCSVFlightsExample 2538161 ns 2538102 ns 278 bytes_per_second=382.788M/s
ParseCSVVehiclesExample 16641194 ns 16639585 ns 42 bytes_per_second=689.196M/s
ParseCSVStocksExample 3119450 ns 3119364 ns 224 bytes_per_second=672.734M/s
No significant difference that I can see
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-1033983242
@n3world Sorry. Judging by the amount of complexity and runtime overhead it adds, I think it's not desirable indeed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] n3world commented on pull request #10662: ARROW-13252: [C++] Add offset to CSV error reporting
Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10662:
URL: https://github.com/apache/arrow/pull/10662#issuecomment-942316179
> Sorry for not answering earlier; but the degradation looks quite concerning indeed. A minor difference (e.g. 3-4%) would be ok, but we seem to be getting a 30% slowdown in some of these cases.
I agree that the performance change seems severe. I am not sure why adding ints to a second buffer would affect the performance that much. I also was getting wildly different values between runs on the same branch sometimes almost a 50% change so I am not sure how reliable any of those numbers really are. I can play with it some to see if I can figure out what is going on.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org