You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/03 09:01:45 UTC
[GitHub] [arrow] cyb70289 opened a new pull request #10859: ARROW-13534: [C++] Improve csv chunker
cyb70289 opened a new pull request #10859:
URL: https://github.com/apache/arrow/pull/10859
Add hints to help compiler deleting dead code per template augments.
Tested with clang-10, gcc-9.3, on x86 and Arm, observed ~20% uplift for
quoting=false benchmark. Probably due to better code locality.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #10859: ARROW-13534: [C++] Improve csv chunker
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#discussion_r681581313
##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -63,12 +63,18 @@ class Lexer {
case IN_FIELD:
goto InField;
case AT_ESCAPE:
+ // will never reach here if escaping = false
+ // just to hint the compiler to remove dead code
+ if (!escaping) return nullptr;
goto AtEscape;
case IN_QUOTED_FIELD:
+ if (!quoting) return nullptr;
goto InQuotedField;
Review comment:
Compiler is not able to figure out this code path is impossible when template augment `quoting` is `false`.
Adding this hint makes compiler eliminating unnecessary dead code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #10859: ARROW-13534: [C++] Improve csv chunker
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#issuecomment-891711774
Nice! Can the same technique be applied to the CSV parser?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #10859: ARROW-13534: [C++] Improve csv chunker
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#issuecomment-891720190
Didn't investigate deep but it looks easy for compiler to figure out the dead code in CSV parser.
CSV chunker::ReadLine is "stateful", so it's harder for compiler.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on pull request #10859: ARROW-13534: [C++] Improve csv chunker
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#issuecomment-891673429
`ChunkCSVEscapedBlock` test (quoting=false) improves significantly.
Below result is on xeon gold 5128. Similar result from Arm.
$ release/arrow-csv-parser-benchmark --benchmark_filter="ChunkCSV.*"
**clang-10**
Before
```
ChunkCSVQuotedBlock 156626 ns 156625 ns 4483 bytes_per_second=1035.11M/s
ChunkCSVEscapedBlock 271061 ns 271063 ns 2585 bytes_per_second=562.925M/s
ChunkCSVNoNewlinesBlock 214 ns 214 ns 3278776 bytes_per_second=0/s
```
After
```
ChunkCSVQuotedBlock 156100 ns 156100 ns 4483 bytes_per_second=1038.6M/s
ChunkCSVEscapedBlock 204164 ns 204164 ns 3423 bytes_per_second=747.381M/s
ChunkCSVNoNewlinesBlock 214 ns 214 ns 3286563 bytes_per_second=0/s
```
**gcc-9.3**
Before
```
ChunkCSVQuotedBlock 218153 ns 218151 ns 3208 bytes_per_second=743.176M/s
ChunkCSVEscapedBlock 225251 ns 225251 ns 3107 bytes_per_second=677.414M/s
ChunkCSVNoNewlinesBlock 202 ns 202 ns 3467515 bytes_per_second=0/s
```
After
```
ChunkCSVQuotedBlock 222775 ns 222776 ns 3141 bytes_per_second=727.747M/s
ChunkCSVEscapedBlock 186540 ns 186541 ns 3751 bytes_per_second=817.984M/s
ChunkCSVNoNewlinesBlock 202 ns 202 ns 3469361 bytes_per_second=0/s
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] cyb70289 commented on a change in pull request #10859: ARROW-13534: [C++] Improve csv chunker
Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#discussion_r681592003
##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -63,12 +63,18 @@ class Lexer {
case IN_FIELD:
goto InField;
case AT_ESCAPE:
+ // will never reach here if escaping = false
+ // just to hint the compiler to remove dead code
+ if (!escaping) return nullptr;
goto AtEscape;
case IN_QUOTED_FIELD:
+ if (!quoting) return nullptr;
goto InQuotedField;
Review comment:
The way I validate above assumption is a bit rough.
I added below code snippet after [InQuotedField](https://github.com/apache/arrow/blob/de7cc1e4820e96531aae3b7f3fc9105f7b3a5f3a/cpp/src/arrow/csv/chunker.cc#L124)
```
InQuotedField:
if (!quoting) {
__asm volatile ("pause" ::: "memory");
__asm volatile ("pause" ::: "memory");
__asm volatile ("pause" ::: "memory");
}
```
Logically, these three "pause" instruction should not be available.
Without this patch, I can find the three continuous pause in disassembled libarrow.so. With this patch, they are gone.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #10859: ARROW-13534: [C++] Improve csv chunker
Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10859:
URL: https://github.com/apache/arrow/pull/10859
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #10859: ARROW-13534: [C++] Improve csv chunker
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#issuecomment-891669715
https://issues.apache.org/jira/browse/ARROW-13534
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org