You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/23 13:50:17 UTC

[GitHub] [arrow] n3world opened a new pull request #10790: ARROW-12673: Add callback to handle incorrect column counts

n3world opened a new pull request #10790:
URL: https://github.com/apache/arrow/pull/10790


   Add a new callback handler to the csv parser which is invoked when the column count in a row does not match the expected. The callback can return an enum which indicates if the row should result in an error result or it should be skipped.
   
   Replaces https://github.com/apache/arrow/pull/10202


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r706383355



##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
         if (*(end - 1) == '\r') {
           --end;
         }
-        return MismatchingColumns(batch_.num_cols_, num_cols,
-                                  first_row_ < 0 ? -1 : first_row_ + batch_.num_rows_,
-                                  util::string_view(start, end - start));
+        int32_t batch_row = batch_.num_rows_ + batch_.num_skipped_rows();
+        InvalidRow row{batch_.num_cols_, num_cols,
+                       first_row_ < 0 ? -1 : first_row_ + batch_row,
+                       util::string_view(start, end - start)};
+
+        if (options_.invalid_row_handler) {
+          if (options_.invalid_row_handler.value()(row) == InvalidRowResult::Skip) {
+            values_writer->RollbackLine();
+            parsed_writer->RollbackLine();
+            auto last_skip = batch_.skipped_rows_.rbegin();
+            if (last_skip == batch_.skipped_rows_.rend() ||
+                last_skip->second + 1 != batch_row) {
+              batch_.skipped_rows_.emplace_back(batch_row, batch_row);
+            } else {
+              last_skip->second = batch_row;
+            }

Review comment:
       @pitrou I changed it to just a vector of ints. Let me know if you like that better.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#issuecomment-910358888


   Benchmark runs are scheduled for baseline = ff5572c845118c6228a896ef9223e81f660cda66 and contender = b2f9e0645a9a53a0df32ebe31e9ec1248e54c131. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1679bfd911764732925a77fea88b96ff...4728d6a0b09a4f88939b20a12f4d2065/)
   [Finished :arrow_down:0.73% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/93263b5dafdd4c8f9efc57d6fb26d16e...312491a87e604e5488121152ae5a2cd5/)
   [Finished :arrow_down:0.6% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/df83a1ab7e55452983ebfcc13b89b7c5...fd1a8d4acf884769b48a1a166d63a320/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700694257



##########
File path: cpp/src/arrow/csv/test_common.h
##########
@@ -47,7 +47,7 @@ ARROW_TESTING_EXPORT
 void MakeColumnParser(std::vector<std::string> items, std::shared_ptr<BlockParser>* out);
 
 ARROW_TESTING_EXPORT
-Result<std::shared_ptr<Buffer>> MakeSampleCsvBuffer(size_t num_rows, bool valid = true);
+Result<std::shared_ptr<Buffer>> MakeSampleCsvBuffer(size_t num_rows, int invalid = 0);

Review comment:
       Changed it to a predicate function




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10790: ARROW-12673: Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#issuecomment-885653064


   https://issues.apache.org/jira/browse/ARROW-12673


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#issuecomment-910358888


   Benchmark runs are scheduled for baseline = ff5572c845118c6228a896ef9223e81f660cda66 and contender = b2f9e0645a9a53a0df32ebe31e9ec1248e54c131. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1679bfd911764732925a77fea88b96ff...4728d6a0b09a4f88939b20a12f4d2065/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/93263b5dafdd4c8f9efc57d6fb26d16e...312491a87e604e5488121152ae5a2cd5/)
   [Finished :arrow_down:0.6% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/df83a1ab7e55452983ebfcc13b89b7c5...fd1a8d4acf884769b48a1a166d63a320/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700563231



##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
         if (*(end - 1) == '\r') {
           --end;
         }
-        return MismatchingColumns(batch_.num_cols_, num_cols,
-                                  first_row_ < 0 ? -1 : first_row_ + batch_.num_rows_,
-                                  util::string_view(start, end - start));
+        int32_t batch_row = batch_.num_rows_ + batch_.num_skipped_rows();
+        InvalidRow row{batch_.num_cols_, num_cols,
+                       first_row_ < 0 ? -1 : first_row_ + batch_row,
+                       util::string_view(start, end - start)};
+
+        if (options_.invalid_row_handler) {
+          if (options_.invalid_row_handler.value()(row) == InvalidRowResult::Skip) {
+            values_writer->RollbackLine();
+            parsed_writer->RollbackLine();
+            auto last_skip = batch_.skipped_rows_.rbegin();
+            if (last_skip == batch_.skipped_rows_.rend() ||
+                last_skip->second + 1 != batch_row) {
+              batch_.skipped_rows_.emplace_back(batch_row, batch_row);
+            } else {
+              last_skip->second = batch_row;
+            }

Review comment:
       Just trying to reduce the potential overhead of a lot of skipped rows, but if this is over engineered I can just go to a simple list of skipped rows




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700693520



##########
File path: cpp/src/arrow/csv/options.h
##########
@@ -58,6 +61,8 @@ struct ARROW_EXPORT ParseOptions {
   /// Whether empty lines are ignored.  If false, an empty line represents
   /// a single empty value (assuming a one-column CSV file).
   bool ignore_empty_lines = true;
+  /// A handler function for rows which do not have the correct number of columns
+  util::optional<InvalidRowHandler> invalid_row_handler;

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700693595



##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
         if (*(end - 1) == '\r') {
           --end;
         }
-        return MismatchingColumns(batch_.num_cols_, num_cols,
-                                  first_row_ < 0 ? -1 : first_row_ + batch_.num_rows_,
-                                  util::string_view(start, end - start));
+        int32_t batch_row = batch_.num_rows_ + batch_.num_skipped_rows();

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#issuecomment-910358036


   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10790:
URL: https://github.com/apache/arrow/pull/10790


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#issuecomment-910358888


   Benchmark runs are scheduled for baseline = ff5572c845118c6228a896ef9223e81f660cda66 and contender = b2f9e0645a9a53a0df32ebe31e9ec1248e54c131. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1679bfd911764732925a77fea88b96ff...4728d6a0b09a4f88939b20a12f4d2065/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/93263b5dafdd4c8f9efc57d6fb26d16e...312491a87e604e5488121152ae5a2cd5/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/df83a1ab7e55452983ebfcc13b89b7c5...fd1a8d4acf884769b48a1a166d63a320/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#issuecomment-909720593


   @pitrou does this implementation seem better than the previous?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700693691



##########
File path: cpp/src/arrow/csv/parser.h
##########
@@ -61,25 +62,41 @@ class ARROW_EXPORT DataBatch {
   int32_t num_cols() const { return num_cols_; }
   /// \brief Return the total size in bytes of parsed data
   uint32_t num_bytes() const { return parsed_size_; }
+  /// \brief Return the total number of rows skipped
+  int32_t num_skipped_rows() const {
+    if (ARROW_PREDICT_TRUE(skipped_rows_.empty())) {
+      return 0;
+    }
+    int32_t skipped = 0;
+    for (const auto& skip_range : skipped_rows_) {
+      skipped += skip_range.second - skip_range.first + 1;
+    }
+    return skipped;
+  }
 
   template <typename Visitor>
   Status VisitColumn(int32_t col_index, int64_t first_row, Visitor&& visit) const {
     using detail::ParsedValueDesc;
 
-    int64_t row = first_row;
+    int32_t batch_row = 0;
     for (size_t buf_index = 0; buf_index < values_buffers_.size(); ++buf_index) {
       const auto& values_buffer = values_buffers_[buf_index];
       const auto values = reinterpret_cast<const ParsedValueDesc*>(values_buffer->data());
       const auto max_pos =
           static_cast<int32_t>(values_buffer->size() / sizeof(ParsedValueDesc)) - 1;
-      for (int32_t pos = col_index; pos < max_pos; pos += num_cols_, ++row) {
+      for (int32_t pos = col_index; pos < max_pos; pos += num_cols_, ++batch_row) {
         auto start = values[pos].offset;
         auto stop = values[pos + 1].offset;
         auto quoted = values[pos + 1].quoted;
         Status status = visit(parsed_ + start, stop - start, quoted);
         if (ARROW_PREDICT_FALSE(!status.ok())) {
           if (first_row >= 0) {

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700563231



##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
         if (*(end - 1) == '\r') {
           --end;
         }
-        return MismatchingColumns(batch_.num_cols_, num_cols,
-                                  first_row_ < 0 ? -1 : first_row_ + batch_.num_rows_,
-                                  util::string_view(start, end - start));
+        int32_t batch_row = batch_.num_rows_ + batch_.num_skipped_rows();
+        InvalidRow row{batch_.num_cols_, num_cols,
+                       first_row_ < 0 ? -1 : first_row_ + batch_row,
+                       util::string_view(start, end - start)};
+
+        if (options_.invalid_row_handler) {
+          if (options_.invalid_row_handler.value()(row) == InvalidRowResult::Skip) {
+            values_writer->RollbackLine();
+            parsed_writer->RollbackLine();
+            auto last_skip = batch_.skipped_rows_.rbegin();
+            if (last_skip == batch_.skipped_rows_.rend() ||
+                last_skip->second + 1 != batch_row) {
+              batch_.skipped_rows_.emplace_back(batch_row, batch_row);
+            } else {
+              last_skip->second = batch_row;
+            }

Review comment:
       Trying to reduce the potential overhead of a lot of skipped rows and it makes the row number calculation in DataBatch::VisitColumn a little easier because I am able to add the skipped rows as ranges instead of individually.
   
   But if you prefer the list of skipped rows I can change it to that and change how the batch_row is calculated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700473094



##########
File path: cpp/src/arrow/csv/test_common.h
##########
@@ -47,7 +47,7 @@ ARROW_TESTING_EXPORT
 void MakeColumnParser(std::vector<std::string> items, std::shared_ptr<BlockParser>* out);
 
 ARROW_TESTING_EXPORT
-Result<std::shared_ptr<Buffer>> MakeSampleCsvBuffer(size_t num_rows, bool valid = true);
+Result<std::shared_ptr<Buffer>> MakeSampleCsvBuffer(size_t num_rows, int invalid = 0);

Review comment:
       `size_t num_invalid_rows` perhaps?

##########
File path: cpp/src/arrow/csv/options.h
##########
@@ -58,6 +61,8 @@ struct ARROW_EXPORT ParseOptions {
   /// Whether empty lines are ignored.  If false, an empty line represents
   /// a single empty value (assuming a one-column CSV file).
   bool ignore_empty_lines = true;
+  /// A handler function for rows which do not have the correct number of columns
+  util::optional<InvalidRowHandler> invalid_row_handler;

Review comment:
       Note that `util::optional` is probably not necessary: a `std::function` can be "empty": see https://en.cppreference.com/w/cpp/utility/functional/function/operator_bool

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
         if (*(end - 1) == '\r') {
           --end;
         }
-        return MismatchingColumns(batch_.num_cols_, num_cols,
-                                  first_row_ < 0 ? -1 : first_row_ + batch_.num_rows_,
-                                  util::string_view(start, end - start));
+        int32_t batch_row = batch_.num_rows_ + batch_.num_skipped_rows();
+        InvalidRow row{batch_.num_cols_, num_cols,
+                       first_row_ < 0 ? -1 : first_row_ + batch_row,
+                       util::string_view(start, end - start)};
+
+        if (options_.invalid_row_handler) {
+          if (options_.invalid_row_handler.value()(row) == InvalidRowResult::Skip) {
+            values_writer->RollbackLine();
+            parsed_writer->RollbackLine();
+            auto last_skip = batch_.skipped_rows_.rbegin();
+            if (last_skip == batch_.skipped_rows_.rend() ||
+                last_skip->second + 1 != batch_row) {
+              batch_.skipped_rows_.emplace_back(batch_row, batch_row);
+            } else {
+              last_skip->second = batch_row;
+            }

Review comment:
       Why not keep things simple and store all skipped row numbers?

##########
File path: cpp/src/arrow/csv/parser.h
##########
@@ -61,25 +62,41 @@ class ARROW_EXPORT DataBatch {
   int32_t num_cols() const { return num_cols_; }
   /// \brief Return the total size in bytes of parsed data
   uint32_t num_bytes() const { return parsed_size_; }
+  /// \brief Return the total number of rows skipped
+  int32_t num_skipped_rows() const {
+    if (ARROW_PREDICT_TRUE(skipped_rows_.empty())) {
+      return 0;
+    }
+    int32_t skipped = 0;
+    for (const auto& skip_range : skipped_rows_) {
+      skipped += skip_range.second - skip_range.first + 1;
+    }
+    return skipped;
+  }
 
   template <typename Visitor>
   Status VisitColumn(int32_t col_index, int64_t first_row, Visitor&& visit) const {
     using detail::ParsedValueDesc;
 
-    int64_t row = first_row;
+    int32_t batch_row = 0;
     for (size_t buf_index = 0; buf_index < values_buffers_.size(); ++buf_index) {
       const auto& values_buffer = values_buffers_[buf_index];
       const auto values = reinterpret_cast<const ParsedValueDesc*>(values_buffer->data());
       const auto max_pos =
           static_cast<int32_t>(values_buffer->size() / sizeof(ParsedValueDesc)) - 1;
-      for (int32_t pos = col_index; pos < max_pos; pos += num_cols_, ++row) {
+      for (int32_t pos = col_index; pos < max_pos; pos += num_cols_, ++batch_row) {
         auto start = values[pos].offset;
         auto stop = values[pos + 1].offset;
         auto quoted = values[pos + 1].quoted;
         Status status = visit(parsed_ + start, stop - start, quoted);
         if (ARROW_PREDICT_FALSE(!status.ok())) {
           if (first_row >= 0) {

Review comment:
       Perhaps move the error handling here in a dedicated method?

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
         if (*(end - 1) == '\r') {
           --end;
         }
-        return MismatchingColumns(batch_.num_cols_, num_cols,
-                                  first_row_ < 0 ? -1 : first_row_ + batch_.num_rows_,
-                                  util::string_view(start, end - start));
+        int32_t batch_row = batch_.num_rows_ + batch_.num_skipped_rows();

Review comment:
       `num_skipped_rows` is potentially O(n)... so, say you have one skipped row every two rows, parsing becomes O(n²)?

##########
File path: cpp/src/arrow/csv/test_common.h
##########
@@ -47,7 +47,7 @@ ARROW_TESTING_EXPORT
 void MakeColumnParser(std::vector<std::string> items, std::shared_ptr<BlockParser>* out);
 
 ARROW_TESTING_EXPORT
-Result<std::shared_ptr<Buffer>> MakeSampleCsvBuffer(size_t num_rows, bool valid = true);
+Result<std::shared_ptr<Buffer>> MakeSampleCsvBuffer(size_t num_rows, int invalid = 0);

Review comment:
       Or instead perhaps pass an optional predicate: `std::function<bool(int row_num)> is_valid`.
   This will make the test values less magical.

##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
         if (*(end - 1) == '\r') {
           --end;
         }
-        return MismatchingColumns(batch_.num_cols_, num_cols,

Review comment:
       Since the code for handling an invalid row is becoming non-trivial, perhaps move it into a dedicated method?
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#issuecomment-909720593


   @pitrou does this implementation seem better than the previous?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#issuecomment-910358888


   Benchmark runs are scheduled for baseline = ff5572c845118c6228a896ef9223e81f660cda66 and contender = b2f9e0645a9a53a0df32ebe31e9ec1248e54c131. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1679bfd911764732925a77fea88b96ff...4728d6a0b09a4f88939b20a12f4d2065/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/93263b5dafdd4c8f9efc57d6fb26d16e...312491a87e604e5488121152ae5a2cd5/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/df83a1ab7e55452983ebfcc13b89b7c5...fd1a8d4acf884769b48a1a166d63a320/)
   Supported benchmarks:
   ursa-i9-9960x: langs = Python, R, JavaScript
   ursa-thinkcentre-m75q: langs = C++, Java
   ec2-t3-xlarge-us-east-2: cloud = True
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] n3world commented on a change in pull request #10790: ARROW-12673: [C++] Add callback to handle incorrect column counts

Posted by GitBox <gi...@apache.org>.
n3world commented on a change in pull request #10790:
URL: https://github.com/apache/arrow/pull/10790#discussion_r700693628



##########
File path: cpp/src/arrow/csv/parser.cc
##########
@@ -324,9 +324,29 @@ class BlockParserImpl {
         if (*(end - 1) == '\r') {
           --end;
         }
-        return MismatchingColumns(batch_.num_cols_, num_cols,

Review comment:
       Done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org