You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/06 11:13:57 UTC

[GitHub] [arrow] cyb70289 opened a new pull request #12574: ARROW-15854: [C++] Refine CSV writer code

cyb70289 opened a new pull request #12574:
URL: https://github.com/apache/arrow/pull/12574


   - populate columns in normal order
   - support int64 row length


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 closed pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 closed pull request #12574:
URL: https://github.com/apache/arrow/pull/12574


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1062445358


   Benchmark runs are scheduled for baseline = 8f901e02ba1d4cbc89bc0db725bfdc2e67510862 and contender = a5dc2526d30731c4cd8bfa253d28084622026ed7. a5dc2526d30731c4cd8bfa253d28084622026ed7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7c997ca9289b43798ac731f898c2d45f...da3ea2d14beb4f799522b28e130c5991/)
   [Finished :arrow_down:1.84% :arrow_up:1.72%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/cd6de633cc3f4dc6920fa5d54fc908fc...1bd18a55b0d2426b8c2f23757239db91/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/bc22f407b7af42428024b8ac53026909...34cfe2e59ffc4e8eb51bc1268e1e6462/)
   [Finished :arrow_down:0.34% :arrow_up:0.21%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a1f2c4d6a2764f33b888de6b1a4ced8b...6d0f04761c6d4bf9b93bead20f99dc59/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1062445358


   Benchmark runs are scheduled for baseline = 8f901e02ba1d4cbc89bc0db725bfdc2e67510862 and contender = a5dc2526d30731c4cd8bfa253d28084622026ed7. a5dc2526d30731c4cd8bfa253d28084622026ed7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7c997ca9289b43798ac731f898c2d45f...da3ea2d14beb4f799522b28e130c5991/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/cd6de633cc3f4dc6920fa5d54fc908fc...1bd18a55b0d2426b8c2f23757239db91/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/bc22f407b7af42428024b8ac53026909...34cfe2e59ffc4e8eb51bc1268e1e6462/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a1f2c4d6a2764f33b888de6b1a4ced8b...6d0f04761c6d4bf9b93bead20f99dc59/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r821311256



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -572,33 +558,34 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
     }
     // Calculate cumulative offsets for each row (including delimiters).
     // ',' * num_columns - 1(last column doesn't have ,) + eol
-    int32_t delimiters_length =
+    const int32_t delimiters_length =
         static_cast<int32_t>(batch.num_columns() - 1 + options_.eol.size());
-    offsets_[0] += delimiters_length;
-    for (int64_t row = 1; row < batch.num_rows(); row++) {
-      offsets_[row] += offsets_[row - 1] + delimiters_length;
+    int64_t last_row_length = offsets_[0] + delimiters_length;
+    offsets_[0] = 0;
+    for (size_t row = 1; row < offsets_.size(); ++row) {
+      const int64_t this_row_length = offsets_[row] + delimiters_length;
+      offsets_[row] = offsets_[row - 1] + last_row_length;
+      last_row_length = this_row_length;

Review comment:
       comment added




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1062445358


   Benchmark runs are scheduled for baseline = 8f901e02ba1d4cbc89bc0db725bfdc2e67510862 and contender = a5dc2526d30731c4cd8bfa253d28084622026ed7. a5dc2526d30731c4cd8bfa253d28084622026ed7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7c997ca9289b43798ac731f898c2d45f...da3ea2d14beb4f799522b28e130c5991/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/cd6de633cc3f4dc6920fa5d54fc908fc...1bd18a55b0d2426b8c2f23757239db91/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/bc22f407b7af42428024b8ac53026909...34cfe2e59ffc4e8eb51bc1268e1e6462/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a1f2c4d6a2764f33b888de6b1a4ced8b...6d0f04761c6d4bf9b93bead20f99dc59/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r820256307



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -136,15 +134,14 @@ class ColumnPopulator {
   MemoryPool* const pool_;
 };
 
-// Copies the contents of to out properly escaping any necessary characters.
-// Returns the position prior to last copied character (out_end is decremented).
-char* EscapeReverse(arrow::util::string_view s, char* out_end) {
-  for (const char* val = s.data() + s.length() - 1; val >= s.data(); val--, out_end--) {
-    if (*val == '"') {
-      *out_end = *val;
-      out_end--;
+// Copies the contents of s to out properly escaping any necessary characters.
+// Returns the position next to last copied character.
+char* Escape(arrow::util::string_view s, char* out_end) {
+  for (const char c : s) {

Review comment:
       It occurs to me that performance here might be improved by using higher level methods to find the next escape character and memcpy in between.  Probably depends on string lengths being processed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r821278922



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -309,40 +300,33 @@ class QuotedColumnPopulator : public ColumnPopulator {
     return Status::OK();
   }
 
-  Status PopulateColumns(char* output, int32_t* offsets) const override {
+  Status PopulateRows(char* output, int64_t* offsets) const override {
     auto needs_escaping = row_needs_escaping_.begin();
     VisitArrayDataInline<StringType>(
         *(casted_array_->data()),
         [&](arrow::util::string_view s) {
           // still needs string content length to be added
-          char* row_end = output + *offsets;
-          int32_t next_column_offset = 0;
+          char* row = output + *offsets;
+          *row++ = '"';
           if (!*needs_escaping) {
-            next_column_offset = static_cast<int32_t>(s.length() + kQuoteDelimiterCount);
-            memcpy(row_end - next_column_offset + /*quote_offset=*/1, s.data(),
-                   s.length());
+            memcpy(row, s.data(), s.length());
+            row += s.length();
           } else {
-            // Adjust row_end by 2 + end_chars_.size(): 1 quote char, end_chars_.size()
-            // and 1 to position at the first position to write to.
-            next_column_offset = static_cast<int32_t>(
-                row_end - EscapeReverse(s, row_end - 2 - end_chars_.size()));
+            row = Escape(s, row);
           }
-          *(row_end - next_column_offset) = '"';
-          *(row_end - end_chars_.size() - 1) = '"';
-          memcpy(row_end - end_chars_.size(), end_chars_.data(), end_chars_.length());
-          *offsets -= next_column_offset;
+          *row++ = '"';
+          memcpy(row, end_chars_.data(), end_chars_.length());
+          row += end_chars_.length();
+          *offsets += static_cast<int64_t>(row - (output + *offsets));

Review comment:
       Yeah




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r820339520



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -572,33 +558,34 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
     }
     // Calculate cumulative offsets for each row (including delimiters).
     // ',' * num_columns - 1(last column doesn't have ,) + eol
-    int32_t delimiters_length =
+    const int32_t delimiters_length =
         static_cast<int32_t>(batch.num_columns() - 1 + options_.eol.size());
-    offsets_[0] += delimiters_length;
-    for (int64_t row = 1; row < batch.num_rows(); row++) {
-      offsets_[row] += offsets_[row - 1] + delimiters_length;
+    int64_t last_row_length = offsets_[0] + delimiters_length;
+    offsets_[0] = 0;
+    for (size_t row = 1; row < offsets_.size(); ++row) {
+      const int64_t this_row_length = offsets_[row] + delimiters_length;
+      offsets_[row] = offsets_[row - 1] + last_row_length;
+      last_row_length = this_row_length;

Review comment:
       `offsets_[i]` was the length of row `i` before the conversion.
   After the conversion, `offsets_[i]` is the start position of row `i` in output buffer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1059944004


   Benchmark runs are scheduled for baseline = a13870e33eb8b25a1e9cee28a310c7c6cd9a4fb4 and contender = a360344a15ae6cdabef60a2190e2cfeb55c52c76. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/86d347e07f004445908e47ef2a497020...5a487a36834549e5b048e1a51878029f/)
   [Finished :arrow_down:3.56% :arrow_up:3.81%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/01fda4daaae649b790899e54d666210c...30b1772b7ffa4c128ffe542f73463b4e/)
   [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9292568e6e4b4c1882d0564bf8ebc3e7...4ba1ac8d32754eb89f1b1dcf6af5f958/)
   [Finished :arrow_down:0.85% :arrow_up:0.13%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/05cdab4ca5ba44da896bc9997232878a...9b7c09d5b862469aba99d1c025f8d4e4/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r821645141



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -571,34 +556,39 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
           column_populators_[col]->UpdateRowLengths(*batch.column(col), offsets_.data()));
     }
     // Calculate cumulative offsets for each row (including delimiters).
-    // ',' * num_columns - 1(last column doesn't have ,) + eol
-    int32_t delimiters_length =
+    // - before conversion: offsets_[i] = length of i-th row

Review comment:
       ```suggestion
       // - before conversion: offsets_[i] = length of (i-1)th row
   ```

##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -571,34 +556,39 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
           column_populators_[col]->UpdateRowLengths(*batch.column(col), offsets_.data()));
     }
     // Calculate cumulative offsets for each row (including delimiters).
-    // ',' * num_columns - 1(last column doesn't have ,) + eol
-    int32_t delimiters_length =
+    // - before conversion: offsets_[i] = length of i-th row
+    // - after conversion:  offsets_[i] = offset to the starting of i-th row buffer
+    //   - offsets_[0] = 0
+    //   - offsets_[i] = offsets_[i-1] + len(i-1-th row) + len(delimiters)
+    // Delimiters: ',' * num_columns - 1(last column doesn't have ,) + eol

Review comment:
       ```suggestion
       // Delimiters: ',' * (num_columns - 1) + eol
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1059943490


   Don't see obvious performance difference from this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1059943990


   @ursabot please benchmark lang=C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r820309351



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -136,15 +134,14 @@ class ColumnPopulator {
   MemoryPool* const pool_;
 };
 
-// Copies the contents of to out properly escaping any necessary characters.
-// Returns the position prior to last copied character (out_end is decremented).
-char* EscapeReverse(arrow::util::string_view s, char* out_end) {
-  for (const char* val = s.data() + s.length() - 1; val >= s.data(); val--, out_end--) {
-    if (*val == '"') {
-      *out_end = *val;
-      out_end--;
+// Copies the contents of s to out properly escaping any necessary characters.
+// Returns the position next to last copied character.
+char* Escape(arrow::util::string_view s, char* out_end) {
+  for (const char c : s) {

Review comment:
       Yes. It's the main reason I change the scan order. This function is the hotspot of string with quotes benchmark. Reverse scanning makes it hard to improve, the output string header pos is not known until end.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r821286290



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -184,28 +181,23 @@ class UnquotedColumnPopulator : public ColumnPopulator {
     return Status::OK();
   }
 
-  Status PopulateColumns(char* output, int32_t* offsets) const override {
+  Status PopulateRows(char* output, int64_t* offsets) const override {
     // Function applied to valid values cast to string.
     auto valid_function = [&](arrow::util::string_view s) {
-      int32_t next_column_offset = static_cast<int32_t>(s.length() + end_chars_.size());
-      memcpy((output + *offsets - next_column_offset), s.data(), s.length());
-      memcpy((output + *offsets - end_chars_.size()), end_chars_.c_str(),
-             end_chars_.size());
-      *offsets -= next_column_offset;
+      memcpy(output + *offsets, s.data(), s.length());
+      memcpy(output + *offsets + s.length(), end_chars_.c_str(), end_chars_.size());
+      *offsets += static_cast<int64_t>(s.length() + end_chars_.size());
       offsets++;
       return Status::OK();
     };
 
     // Function applied to null values cast to string.
     auto null_function = [&]() {
       // For nulls, the configured null value string is copied into the output.
-      int32_t next_column_offset =
-          static_cast<int32_t>(null_string_->size() + end_chars_.size());
-      memcpy((output + *offsets - next_column_offset), null_string_->data(),
-             null_string_->size());
-      memcpy((output + *offsets - end_chars_.size()), end_chars_.c_str(),
+      memcpy(output + *offsets, null_string_->data(), null_string_->size());
+      memcpy(output + *offsets + null_string_->size(), end_chars_.c_str(),
              end_chars_.size());
-      *offsets -= next_column_offset;
+      *offsets += static_cast<int32_t>(null_string_->size() + end_chars_.size());

Review comment:
       changed

##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -309,40 +300,33 @@ class QuotedColumnPopulator : public ColumnPopulator {
     return Status::OK();
   }
 
-  Status PopulateColumns(char* output, int32_t* offsets) const override {
+  Status PopulateRows(char* output, int64_t* offsets) const override {
     auto needs_escaping = row_needs_escaping_.begin();
     VisitArrayDataInline<StringType>(
         *(casted_array_->data()),
         [&](arrow::util::string_view s) {
           // still needs string content length to be added
-          char* row_end = output + *offsets;
-          int32_t next_column_offset = 0;
+          char* row = output + *offsets;
+          *row++ = '"';
           if (!*needs_escaping) {
-            next_column_offset = static_cast<int32_t>(s.length() + kQuoteDelimiterCount);
-            memcpy(row_end - next_column_offset + /*quote_offset=*/1, s.data(),
-                   s.length());
+            memcpy(row, s.data(), s.length());
+            row += s.length();
           } else {
-            // Adjust row_end by 2 + end_chars_.size(): 1 quote char, end_chars_.size()
-            // and 1 to position at the first position to write to.
-            next_column_offset = static_cast<int32_t>(
-                row_end - EscapeReverse(s, row_end - 2 - end_chars_.size()));
+            row = Escape(s, row);
           }
-          *(row_end - next_column_offset) = '"';
-          *(row_end - end_chars_.size() - 1) = '"';
-          memcpy(row_end - end_chars_.size(), end_chars_.data(), end_chars_.length());
-          *offsets -= next_column_offset;
+          *row++ = '"';
+          memcpy(row, end_chars_.data(), end_chars_.length());
+          row += end_chars_.length();
+          *offsets += static_cast<int64_t>(row - (output + *offsets));
           offsets++;
           needs_escaping++;
         },
         [&]() {
           // For nulls, the configured null value string is copied into the output.
-          int32_t next_column_offset =
-              static_cast<int32_t>(null_string_->size() + end_chars_.size());
-          memcpy((output + *offsets - next_column_offset), null_string_->data(),
-                 null_string_->size());
-          memcpy((output + *offsets - end_chars_.size()), end_chars_.c_str(),
+          memcpy(output + *offsets, null_string_->data(), null_string_->size());
+          memcpy(output + *offsets + null_string_->size(), end_chars_.c_str(),
                  end_chars_.size());
-          *offsets -= next_column_offset;
+          *offsets += static_cast<int32_t>(null_string_->size() + end_chars_.size());

Review comment:
       changed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r820218686



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -117,17 +116,16 @@ class ColumnPopulator {
   }
 
   // Places string data onto each row in output and updates the corresponding row
-  // pointers in preparation for calls to other (preceding) ColumnPopulators.
+  // pointers in preparation for calls to other (next) ColumnPopulators.
   // Implementations may apply certain checks e.g. for illegal values, which in case of
   // failure causes this function to return an error Status.
   // Args:
   //   output: character buffer to write to.
-  //   offsets: an array of end of row column within the the output buffer (values are
-  //   one past the end of the position to write to).
-  virtual Status PopulateColumns(char* output, int32_t* offsets) const = 0;
+  //   offsets: an array of start of row column within the output buffer.
+  virtual Status PopulateRows(char* output, int64_t* offsets) const = 0;

Review comment:
       Name `PopulateColumns` is a bit confusing IMO as this function writes rows of a single column to csv output buffer.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1059944004


   Benchmark runs are scheduled for baseline = a13870e33eb8b25a1e9cee28a310c7c6cd9a4fb4 and contender = a360344a15ae6cdabef60a2190e2cfeb55c52c76. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/86d347e07f004445908e47ef2a497020...5a487a36834549e5b048e1a51878029f/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/01fda4daaae649b790899e54d666210c...30b1772b7ffa4c128ffe542f73463b4e/)
   [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/9292568e6e4b4c1882d0564bf8ebc3e7...4ba1ac8d32754eb89f1b1dcf6af5f958/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/05cdab4ca5ba44da896bc9997232878a...9b7c09d5b862469aba99d1c025f8d4e4/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r821276863



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -136,15 +134,14 @@ class ColumnPopulator {
   MemoryPool* const pool_;
 };
 
-// Copies the contents of to out properly escaping any necessary characters.
-// Returns the position prior to last copied character (out_end is decremented).
-char* EscapeReverse(arrow::util::string_view s, char* out_end) {
-  for (const char* val = s.data() + s.length() - 1; val >= s.data(); val--, out_end--) {
-    if (*val == '"') {
-      *out_end = *val;
-      out_end--;
+// Copies the contents of s to out properly escaping any necessary characters.
+// Returns the position next to last copied character.
+char* Escape(arrow::util::string_view s, char* out_end) {

Review comment:
       changed




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1059942623


   https://issues.apache.org/jira/browse/ARROW-15854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r821729703



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -571,34 +556,39 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
           column_populators_[col]->UpdateRowLengths(*batch.column(col), offsets_.data()));
     }
     // Calculate cumulative offsets for each row (including delimiters).
-    // ',' * num_columns - 1(last column doesn't have ,) + eol
-    int32_t delimiters_length =
+    // - before conversion: offsets_[i] = length of i-th row

Review comment:
       Oops, my bad.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r821705177



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -571,34 +556,39 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
           column_populators_[col]->UpdateRowLengths(*batch.column(col), offsets_.data()));
     }
     // Calculate cumulative offsets for each row (including delimiters).
-    // ',' * num_columns - 1(last column doesn't have ,) + eol
-    int32_t delimiters_length =
+    // - before conversion: offsets_[i] = length of i-th row

Review comment:
       `offsets_[i]` is indeed the length of  `i-th` row before conversion, not `(i-1)th`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1062445358


   Benchmark runs are scheduled for baseline = 8f901e02ba1d4cbc89bc0db725bfdc2e67510862 and contender = a5dc2526d30731c4cd8bfa253d28084622026ed7. a5dc2526d30731c4cd8bfa253d28084622026ed7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7c997ca9289b43798ac731f898c2d45f...da3ea2d14beb4f799522b28e130c5991/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/cd6de633cc3f4dc6920fa5d54fc908fc...1bd18a55b0d2426b8c2f23757239db91/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/bc22f407b7af42428024b8ac53026909...34cfe2e59ffc4e8eb51bc1268e1e6462/)
   [Finished :arrow_down:0.34% :arrow_up:0.21%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a1f2c4d6a2764f33b888de6b1a4ced8b...6d0f04761c6d4bf9b93bead20f99dc59/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1062440180


   CI errors not related


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1060122050


   > I'm slightly ambivalent on reversing the population direction, it seems border line like change just for the sake of change. I don't recall if the reverse population was necessary for some reason that has changed or just an overcomplicating on my part. If you think it makes things easier to understand I'll do a more a thorough review
   
   Thanks @emkornfield. I agree change just for the sake of change is not good. As I commented above, this change may make further optimization easier. And I do think it simplifies the code. Reading input strings forward continuously is also more friendly to prefetcher, though this workload is cpu bound.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1062445358


   Benchmark runs are scheduled for baseline = 8f901e02ba1d4cbc89bc0db725bfdc2e67510862 and contender = a5dc2526d30731c4cd8bfa253d28084622026ed7. a5dc2526d30731c4cd8bfa253d28084622026ed7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/7c997ca9289b43798ac731f898c2d45f...da3ea2d14beb4f799522b28e130c5991/)
   [Finished :arrow_down:1.84% :arrow_up:1.72%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/cd6de633cc3f4dc6920fa5d54fc908fc...1bd18a55b0d2426b8c2f23757239db91/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/bc22f407b7af42428024b8ac53026909...34cfe2e59ffc4e8eb51bc1268e1e6462/)
   [Finished :arrow_down:0.34% :arrow_up:0.21%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/a1f2c4d6a2764f33b888de6b1a4ced8b...6d0f04761c6d4bf9b93bead20f99dc59/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r820852990



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -136,15 +134,14 @@ class ColumnPopulator {
   MemoryPool* const pool_;
 };
 
-// Copies the contents of to out properly escaping any necessary characters.
-// Returns the position prior to last copied character (out_end is decremented).
-char* EscapeReverse(arrow::util::string_view s, char* out_end) {
-  for (const char* val = s.data() + s.length() - 1; val >= s.data(); val--, out_end--) {
-    if (*val == '"') {
-      *out_end = *val;
-      out_end--;
+// Copies the contents of s to out properly escaping any necessary characters.
+// Returns the position next to last copied character.
+char* Escape(arrow::util::string_view s, char* out_end) {

Review comment:
       Change variable name to match new meaning?
   
   ```suggestion
   char* Escape(arrow::util::string_view s, char* out) {
   ```

##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -184,28 +181,23 @@ class UnquotedColumnPopulator : public ColumnPopulator {
     return Status::OK();
   }
 
-  Status PopulateColumns(char* output, int32_t* offsets) const override {
+  Status PopulateRows(char* output, int64_t* offsets) const override {
     // Function applied to valid values cast to string.
     auto valid_function = [&](arrow::util::string_view s) {
-      int32_t next_column_offset = static_cast<int32_t>(s.length() + end_chars_.size());
-      memcpy((output + *offsets - next_column_offset), s.data(), s.length());
-      memcpy((output + *offsets - end_chars_.size()), end_chars_.c_str(),
-             end_chars_.size());
-      *offsets -= next_column_offset;
+      memcpy(output + *offsets, s.data(), s.length());
+      memcpy(output + *offsets + s.length(), end_chars_.c_str(), end_chars_.size());
+      *offsets += static_cast<int64_t>(s.length() + end_chars_.size());
       offsets++;
       return Status::OK();
     };
 
     // Function applied to null values cast to string.
     auto null_function = [&]() {
       // For nulls, the configured null value string is copied into the output.
-      int32_t next_column_offset =
-          static_cast<int32_t>(null_string_->size() + end_chars_.size());
-      memcpy((output + *offsets - next_column_offset), null_string_->data(),
-             null_string_->size());
-      memcpy((output + *offsets - end_chars_.size()), end_chars_.c_str(),
+      memcpy(output + *offsets, null_string_->data(), null_string_->size());
+      memcpy(output + *offsets + null_string_->size(), end_chars_.c_str(),
              end_chars_.size());
-      *offsets -= next_column_offset;
+      *offsets += static_cast<int32_t>(null_string_->size() + end_chars_.size());

Review comment:
       Why not `int64_t`?

##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -309,40 +300,33 @@ class QuotedColumnPopulator : public ColumnPopulator {
     return Status::OK();
   }
 
-  Status PopulateColumns(char* output, int32_t* offsets) const override {
+  Status PopulateRows(char* output, int64_t* offsets) const override {
     auto needs_escaping = row_needs_escaping_.begin();
     VisitArrayDataInline<StringType>(
         *(casted_array_->data()),
         [&](arrow::util::string_view s) {
           // still needs string content length to be added
-          char* row_end = output + *offsets;
-          int32_t next_column_offset = 0;
+          char* row = output + *offsets;
+          *row++ = '"';
           if (!*needs_escaping) {
-            next_column_offset = static_cast<int32_t>(s.length() + kQuoteDelimiterCount);
-            memcpy(row_end - next_column_offset + /*quote_offset=*/1, s.data(),
-                   s.length());
+            memcpy(row, s.data(), s.length());
+            row += s.length();
           } else {
-            // Adjust row_end by 2 + end_chars_.size(): 1 quote char, end_chars_.size()
-            // and 1 to position at the first position to write to.
-            next_column_offset = static_cast<int32_t>(
-                row_end - EscapeReverse(s, row_end - 2 - end_chars_.size()));
+            row = Escape(s, row);
           }
-          *(row_end - next_column_offset) = '"';
-          *(row_end - end_chars_.size() - 1) = '"';
-          memcpy(row_end - end_chars_.size(), end_chars_.data(), end_chars_.length());
-          *offsets -= next_column_offset;
+          *row++ = '"';
+          memcpy(row, end_chars_.data(), end_chars_.length());
+          row += end_chars_.length();
+          *offsets += static_cast<int64_t>(row - (output + *offsets));
           offsets++;
           needs_escaping++;
         },
         [&]() {
           // For nulls, the configured null value string is copied into the output.
-          int32_t next_column_offset =
-              static_cast<int32_t>(null_string_->size() + end_chars_.size());
-          memcpy((output + *offsets - next_column_offset), null_string_->data(),
-                 null_string_->size());
-          memcpy((output + *offsets - end_chars_.size()), end_chars_.c_str(),
+          memcpy(output + *offsets, null_string_->data(), null_string_->size());
+          memcpy(output + *offsets + null_string_->size(), end_chars_.c_str(),
                  end_chars_.size());
-          *offsets -= next_column_offset;
+          *offsets += static_cast<int32_t>(null_string_->size() + end_chars_.size());

Review comment:
       `int64_t` perhaps?

##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -572,33 +558,34 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
     }
     // Calculate cumulative offsets for each row (including delimiters).
     // ',' * num_columns - 1(last column doesn't have ,) + eol
-    int32_t delimiters_length =
+    const int32_t delimiters_length =
         static_cast<int32_t>(batch.num_columns() - 1 + options_.eol.size());
-    offsets_[0] += delimiters_length;
-    for (int64_t row = 1; row < batch.num_rows(); row++) {
-      offsets_[row] += offsets_[row - 1] + delimiters_length;
+    int64_t last_row_length = offsets_[0] + delimiters_length;
+    offsets_[0] = 0;
+    for (size_t row = 1; row < offsets_.size(); ++row) {
+      const int64_t this_row_length = offsets_[row] + delimiters_length;
+      offsets_[row] = offsets_[row - 1] + last_row_length;
+      last_row_length = this_row_length;

Review comment:
       Can you add a comment with this?

##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -309,40 +300,33 @@ class QuotedColumnPopulator : public ColumnPopulator {
     return Status::OK();
   }
 
-  Status PopulateColumns(char* output, int32_t* offsets) const override {
+  Status PopulateRows(char* output, int64_t* offsets) const override {
     auto needs_escaping = row_needs_escaping_.begin();
     VisitArrayDataInline<StringType>(
         *(casted_array_->data()),
         [&](arrow::util::string_view s) {
           // still needs string content length to be added
-          char* row_end = output + *offsets;
-          int32_t next_column_offset = 0;
+          char* row = output + *offsets;
+          *row++ = '"';
           if (!*needs_escaping) {
-            next_column_offset = static_cast<int32_t>(s.length() + kQuoteDelimiterCount);
-            memcpy(row_end - next_column_offset + /*quote_offset=*/1, s.data(),
-                   s.length());
+            memcpy(row, s.data(), s.length());
+            row += s.length();
           } else {
-            // Adjust row_end by 2 + end_chars_.size(): 1 quote char, end_chars_.size()
-            // and 1 to position at the first position to write to.
-            next_column_offset = static_cast<int32_t>(
-                row_end - EscapeReverse(s, row_end - 2 - end_chars_.size()));
+            row = Escape(s, row);
           }
-          *(row_end - next_column_offset) = '"';
-          *(row_end - end_chars_.size() - 1) = '"';
-          memcpy(row_end - end_chars_.size(), end_chars_.data(), end_chars_.length());
-          *offsets -= next_column_offset;
+          *row++ = '"';
+          memcpy(row, end_chars_.data(), end_chars_.length());
+          row += end_chars_.length();
+          *offsets += static_cast<int64_t>(row - (output + *offsets));

Review comment:
       Wouldn't it be simpler as:
   ```suggestion
             *offsets = static_cast<int64_t>(row - output);
   ```

##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -572,33 +558,34 @@ class CSVWriterImpl : public ipc::RecordBatchWriter {
     }
     // Calculate cumulative offsets for each row (including delimiters).
     // ',' * num_columns - 1(last column doesn't have ,) + eol
-    int32_t delimiters_length =
+    const int32_t delimiters_length =
         static_cast<int32_t>(batch.num_columns() - 1 + options_.eol.size());
-    offsets_[0] += delimiters_length;
-    for (int64_t row = 1; row < batch.num_rows(); row++) {
-      offsets_[row] += offsets_[row - 1] + delimiters_length;
+    int64_t last_row_length = offsets_[0] + delimiters_length;
+    offsets_[0] = 0;
+    for (size_t row = 1; row < offsets_.size(); ++row) {
+      const int64_t this_row_length = offsets_[row] + delimiters_length;
+      offsets_[row] = offsets_[row - 1] + last_row_length;
+      last_row_length = this_row_length;

Review comment:
       Also, you probably mean: "`offsets_[i]` was the length of row `i-1` before the conversion"...?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on a change in pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
emkornfield commented on a change in pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#discussion_r820256069



##########
File path: cpp/src/arrow/csv/writer.cc
##########
@@ -117,17 +116,16 @@ class ColumnPopulator {
   }
 
   // Places string data onto each row in output and updates the corresponding row
-  // pointers in preparation for calls to other (preceding) ColumnPopulators.
+  // pointers in preparation for calls to other (next) ColumnPopulators.
   // Implementations may apply certain checks e.g. for illegal values, which in case of
   // failure causes this function to return an error Status.
   // Args:
   //   output: character buffer to write to.
-  //   offsets: an array of end of row column within the the output buffer (values are
-  //   one past the end of the position to write to).
-  virtual Status PopulateColumns(char* output, int32_t* offsets) const = 0;
+  //   offsets: an array of start of row column within the output buffer.
+  virtual Status PopulateRows(char* output, int64_t* offsets) const = 0;

Review comment:
       The new name seems fine to me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] emkornfield commented on pull request #12574: ARROW-15854: [C++] Refine CSV writer code

Posted by GitBox <gi...@apache.org>.
emkornfield commented on pull request #12574:
URL: https://github.com/apache/arrow/pull/12574#issuecomment-1059993867


   I'm slightly ambivalent on reversing the population direction, it seems border line like change just for the sake of change.  I don't recall if the reverse population was necessary for some reason that has changed or just an overcomplicating on my part.   If you think it makes things easier to understand I'll do a more a thorough review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org