You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/27 02:48:50 UTC

[GitHub] [arrow] cyb70289 opened a new pull request, #13437: ARROW-16872: [C++] Fix csv parser edge case

cyb70289 opened a new pull request, #13437:
URL: https://github.com/apache/arrow/pull/13437

   Fixes an error when there's no EOL at last line, and last field is consumed by bulk filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1166775245

   https://issues.apache.org/jira/browse/ARROW-16872


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r909420882


##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
     if (UseBulkFilter) {
       const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
       if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+        if (is_final) {
+          final_offset = static_cast<int32_t>(data_end - data);

Review Comment:
   We could perhaps add a test for this fringe situation...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r911149899


##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -337,6 +340,9 @@ class BlockParserImpl {
     if (UseBulkFilter) {
       const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
       if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+        if (is_final) {
+          data = data_end;
+        }

Review Comment:
   Hmm... I don't *think* so.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1170890498

   Perhaps ARROW_EXTRA_ERROR_CONTEXT is enabled?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r910612974


##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
     if (UseBulkFilter) {
       const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
       if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+        if (is_final) {
+          final_offset = static_cast<int32_t>(data_end - data);

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1170898952

   > Perhaps ARROW_EXTRA_ERROR_CONTEXT is enabled?
   
   Thanks! `ARROW_EXTRA_ERROR_CONTEXT` is enabled in conda-cpp docker image. It should be the reason.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1168142905

   @ursabot please benchmark command=cpp-micro --suite-filter="arrow-csv-*"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r909418536


##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
     if (UseBulkFilter) {
       const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
       if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+        if (is_final) {
+          final_offset = static_cast<int32_t>(data_end - data);

Review Comment:
   I think the current code is incorrect in this situation: `HandleInvalidRow` is using `util::string_view(start, data - start)` as the invalid row span.
   https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/parser.cc#L203-L215



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r909345242


##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
     if (UseBulkFilter) {
       const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
       if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+        if (is_final) {
+          final_offset = static_cast<int32_t>(data_end - data);

Review Comment:
   Can this simply be
   ```suggestion
             data = data_end;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r909406531


##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
     if (UseBulkFilter) {
       const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
       if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+        if (is_final) {
+          final_offset = static_cast<int32_t>(data_end - data);

Review Comment:
   `data` is used in line https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/parser.cc#L387 if there are fewer rows than expected. Didn't read deep`HandleInvalidRow` code, will updating `data` cause problem?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1168142919

   Benchmark runs are scheduled for baseline = 49f26962456e11902621e41574bc5890205eac7a and contender = ad15fe1a7087ab7dc50d63069d1ad828e138f539. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['lang', 'name'] filters are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/47c0f5a500564250b4ac9f5667b526c0...8d31b8cd9c1347ee938d6fe18f3afd9c/)
   [Skipped :warning: Only ['lang', 'name'] filters are supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/55fb43e538e64d63ae662ab3c6169c3a...7161074bf7df469bb52f1732ee318e12/)
   [Skipped :warning: Only ['lang', 'name'] filters are supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/19bc0247bd6e4b1c876549156d3b8823...7ed247e23a844784aff5024c4ad899a2/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/93c567ff9bec4d68b3932f5b837a41c8...e9445ac88513452cbc6d9cb4e38fea33/)
   Buildkite builds:
   [Scheduled] [`ad15fe1a` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/999)
   [Finished] [`49f26962` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/988)
   [Finished] [`49f26962` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/989)
   [Failed] [`49f26962` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/978)
   [Finished] [`49f26962` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/992)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1170886198

   CI error is strange: https://github.com/apache/arrow/runs/7125162553?check_suite_focus=true#step:6:3704
   ```
   [ RUN      ] BlockParser.FinalTruncatedBulkFilterNoEol
   /arrow/cpp/src/arrow/csv/parser_test.cc:423: Failure
   Expected equality of these values:
     ("Invalid: CSV parse error: Expected 2 columns, got 1: 87654321")
       Which is: "Invalid: CSV parse error: Expected 2 columns, got 1: 87654321"
     _st.ToString()
       Which is: "Invalid: CSV parse error: Expected 2 columns, got 1: 87654321\n/arrow/cpp/src/arrow/csv/parser.cc:464  (ParseLine<SpecializedOptions, true>(values_writer, parsed_writer, data, data_end, is_final, &line_end, bulk_filter))\n/arrow/cpp/src/arrow/csv/parser.cc:566  ParseChunk<SpecializedOptions>( &values_writer, &parsed_writer, data, data_end, is_final, rows_in_chunk, &data, &finished_parsing, bulk_filter)"
   With diff:
   @@ +1,3 @@
    Invalid: CSV parse error: Expected 2 columns, got 1: 87654321
   +/arrow/cpp/src/arrow/csv/parser.cc:464  (ParseLine<SpecializedOptions, true>(values_writer, parsed_writer, data, data_end, is_final, &line_end, bulk_filter))
   +/arrow/cpp/src/arrow/csv/parser.cc:566  ParseChunk<SpecializedOptions>( &values_writer, &parsed_writer, data, data_end, is_final, rows_in_chunk, &data, &finished_parsing, bulk_filter)
   [  FAILED  ] BlockParser.FinalTruncatedBulkFilterNoEol (0 ms)
   ```
   Looks like some gtest debug messages are appended to the status string.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #13437: ARROW-16872: [C++] Fix CSV parser edge case

Posted by GitBox <gi...@apache.org>.
pitrou merged PR #13437:
URL: https://github.com/apache/arrow/pull/13437


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cyb70289 commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r910612092


##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -337,6 +340,9 @@ class BlockParserImpl {
     if (UseBulkFilter) {
       const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
       if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+        if (is_final) {
+          data = data_end;
+        }

Review Comment:
   Is it an error if this happens in a quoted field?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org