You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/06/27 02:48:50 UTC
[GitHub] [arrow] cyb70289 opened a new pull request, #13437: ARROW-16872: [C++] Fix csv parser edge case
cyb70289 opened a new pull request, #13437:
URL: https://github.com/apache/arrow/pull/13437
Fixes an error when there's no EOL at last line, and last field is consumed by bulk filter.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1166775245
https://issues.apache.org/jira/browse/ARROW-16872
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r909420882
##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
if (UseBulkFilter) {
const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+ if (is_final) {
+ final_offset = static_cast<int32_t>(data_end - data);
Review Comment:
We could perhaps add a test for this fringe situation...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r911149899
##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -337,6 +340,9 @@ class BlockParserImpl {
if (UseBulkFilter) {
const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+ if (is_final) {
+ data = data_end;
+ }
Review Comment:
Hmm... I don't *think* so.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
pitrou commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1170890498
Perhaps ARROW_EXTRA_ERROR_CONTEXT is enabled?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r910612974
##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
if (UseBulkFilter) {
const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+ if (is_final) {
+ final_offset = static_cast<int32_t>(data_end - data);
Review Comment:
Done
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1170898952
> Perhaps ARROW_EXTRA_ERROR_CONTEXT is enabled?
Thanks! `ARROW_EXTRA_ERROR_CONTEXT` is enabled in conda-cpp docker image. It should be the reason.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1168142905
@ursabot please benchmark command=cpp-micro --suite-filter="arrow-csv-*"
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r909418536
##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
if (UseBulkFilter) {
const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+ if (is_final) {
+ final_offset = static_cast<int32_t>(data_end - data);
Review Comment:
I think the current code is incorrect in this situation: `HandleInvalidRow` is using `util::string_view(start, data - start)` as the invalid row span.
https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/parser.cc#L203-L215
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r909345242
##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
if (UseBulkFilter) {
const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+ if (is_final) {
+ final_offset = static_cast<int32_t>(data_end - data);
Review Comment:
Can this simply be
```suggestion
data = data_end;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r909406531
##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -296,6 +297,9 @@ class BlockParserImpl {
if (UseBulkFilter) {
const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+ if (is_final) {
+ final_offset = static_cast<int32_t>(data_end - data);
Review Comment:
`data` is used in line https://github.com/apache/arrow/blob/master/cpp/src/arrow/csv/parser.cc#L387 if there are fewer rows than expected. Didn't read deep`HandleInvalidRow` code, will updating `data` cause problem?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] ursabot commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1168142919
Benchmark runs are scheduled for baseline = 49f26962456e11902621e41574bc5890205eac7a and contender = ad15fe1a7087ab7dc50d63069d1ad828e138f539. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped :warning: Only ['lang', 'name'] filters are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/47c0f5a500564250b4ac9f5667b526c0...8d31b8cd9c1347ee938d6fe18f3afd9c/)
[Skipped :warning: Only ['lang', 'name'] filters are supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/55fb43e538e64d63ae662ab3c6169c3a...7161074bf7df469bb52f1732ee318e12/)
[Skipped :warning: Only ['lang', 'name'] filters are supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/19bc0247bd6e4b1c876549156d3b8823...7ed247e23a844784aff5024c4ad899a2/)
[Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/93c567ff9bec4d68b3932f5b837a41c8...e9445ac88513452cbc6d9cb4e38fea33/)
Buildkite builds:
[Scheduled] [`ad15fe1a` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/999)
[Finished] [`49f26962` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/988)
[Finished] [`49f26962` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/989)
[Failed] [`49f26962` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/978)
[Finished] [`49f26962` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/992)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on PR #13437:
URL: https://github.com/apache/arrow/pull/13437#issuecomment-1170886198
CI error is strange: https://github.com/apache/arrow/runs/7125162553?check_suite_focus=true#step:6:3704
```
[ RUN ] BlockParser.FinalTruncatedBulkFilterNoEol
/arrow/cpp/src/arrow/csv/parser_test.cc:423: Failure
Expected equality of these values:
("Invalid: CSV parse error: Expected 2 columns, got 1: 87654321")
Which is: "Invalid: CSV parse error: Expected 2 columns, got 1: 87654321"
_st.ToString()
Which is: "Invalid: CSV parse error: Expected 2 columns, got 1: 87654321\n/arrow/cpp/src/arrow/csv/parser.cc:464 (ParseLine<SpecializedOptions, true>(values_writer, parsed_writer, data, data_end, is_final, &line_end, bulk_filter))\n/arrow/cpp/src/arrow/csv/parser.cc:566 ParseChunk<SpecializedOptions>( &values_writer, &parsed_writer, data, data_end, is_final, rows_in_chunk, &data, &finished_parsing, bulk_filter)"
With diff:
@@ +1,3 @@
Invalid: CSV parse error: Expected 2 columns, got 1: 87654321
+/arrow/cpp/src/arrow/csv/parser.cc:464 (ParseLine<SpecializedOptions, true>(values_writer, parsed_writer, data, data_end, is_final, &line_end, bulk_filter))
+/arrow/cpp/src/arrow/csv/parser.cc:566 ParseChunk<SpecializedOptions>( &values_writer, &parsed_writer, data, data_end, is_final, rows_in_chunk, &data, &finished_parsing, bulk_filter)
[ FAILED ] BlockParser.FinalTruncatedBulkFilterNoEol (0 ms)
```
Looks like some gtest debug messages are appended to the status string.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou merged pull request #13437: ARROW-16872: [C++] Fix CSV parser edge case
Posted by GitBox <gi...@apache.org>.
pitrou merged PR #13437:
URL: https://github.com/apache/arrow/pull/13437
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a diff in pull request #13437: ARROW-16872: [C++] Fix csv parser edge case
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on code in PR #13437:
URL: https://github.com/apache/arrow/pull/13437#discussion_r910612092
##########
cpp/src/arrow/csv/parser.cc:
##########
@@ -337,6 +340,9 @@ class BlockParserImpl {
if (UseBulkFilter) {
const char* bulk_end = RunBulkFilter(parsed_writer, data, data_end, bulk_filter);
if (ARROW_PREDICT_FALSE(bulk_end == nullptr)) {
+ if (is_final) {
+ data = data_end;
+ }
Review Comment:
Is it an error if this happens in a quoted field?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org