You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/08/03 09:01:45 UTC

[GitHub] [arrow] cyb70289 opened a new pull request #10859: ARROW-13534: [C++] Improve csv chunker

cyb70289 opened a new pull request #10859:
URL: https://github.com/apache/arrow/pull/10859


   Add hints to help compiler deleting dead code per template augments.
   Tested with clang-10, gcc-9.3, on x86 and Arm, observed ~20% uplift for
   quoting=false benchmark. Probably due to better code locality.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #10859: ARROW-13534: [C++] Improve csv chunker

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#discussion_r681581313



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -63,12 +63,18 @@ class Lexer {
       case IN_FIELD:
         goto InField;
       case AT_ESCAPE:
+        // will never reach here if escaping = false
+        // just to hint the compiler to remove dead code
+        if (!escaping) return nullptr;
         goto AtEscape;
       case IN_QUOTED_FIELD:
+        if (!quoting) return nullptr;
         goto InQuotedField;

Review comment:
       Compiler is not able to figure out this code path is impossible when template augment `quoting` is `false`.
   Adding this hint makes compiler eliminating unnecessary dead code.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #10859: ARROW-13534: [C++] Improve csv chunker

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#issuecomment-891711774


   Nice! Can the same technique be applied to the CSV parser?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #10859: ARROW-13534: [C++] Improve csv chunker

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#issuecomment-891720190


   Didn't investigate deep but it looks easy for compiler to figure out the dead code in CSV parser.
   CSV chunker::ReadLine is "stateful", so it's harder for compiler.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #10859: ARROW-13534: [C++] Improve csv chunker

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#issuecomment-891673429


   `ChunkCSVEscapedBlock` test (quoting=false) improves significantly.
   Below result is on xeon gold 5128. Similar result from Arm.
   
   $ release/arrow-csv-parser-benchmark  --benchmark_filter="ChunkCSV.*"
   
   **clang-10**
   Before
   ```
   ChunkCSVQuotedBlock         156626 ns       156625 ns         4483 bytes_per_second=1035.11M/s
   ChunkCSVEscapedBlock        271061 ns       271063 ns         2585 bytes_per_second=562.925M/s
   ChunkCSVNoNewlinesBlock        214 ns          214 ns      3278776 bytes_per_second=0/s
   ```
   After
   ```
   ChunkCSVQuotedBlock         156100 ns       156100 ns         4483 bytes_per_second=1038.6M/s
   ChunkCSVEscapedBlock        204164 ns       204164 ns         3423 bytes_per_second=747.381M/s
   ChunkCSVNoNewlinesBlock        214 ns          214 ns      3286563 bytes_per_second=0/s
   ```
   
   
   **gcc-9.3**
   Before
   ```
   ChunkCSVQuotedBlock         218153 ns       218151 ns         3208 bytes_per_second=743.176M/s
   ChunkCSVEscapedBlock        225251 ns       225251 ns         3107 bytes_per_second=677.414M/s
   ChunkCSVNoNewlinesBlock        202 ns          202 ns      3467515 bytes_per_second=0/s
   ```
   After
   ```
   ChunkCSVQuotedBlock         222775 ns       222776 ns         3141 bytes_per_second=727.747M/s
   ChunkCSVEscapedBlock        186540 ns       186541 ns         3751 bytes_per_second=817.984M/s
   ChunkCSVNoNewlinesBlock        202 ns          202 ns      3469361 bytes_per_second=0/s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on a change in pull request #10859: ARROW-13534: [C++] Improve csv chunker

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on a change in pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#discussion_r681592003



##########
File path: cpp/src/arrow/csv/chunker.cc
##########
@@ -63,12 +63,18 @@ class Lexer {
       case IN_FIELD:
         goto InField;
       case AT_ESCAPE:
+        // will never reach here if escaping = false
+        // just to hint the compiler to remove dead code
+        if (!escaping) return nullptr;
         goto AtEscape;
       case IN_QUOTED_FIELD:
+        if (!quoting) return nullptr;
         goto InQuotedField;

Review comment:
       The way I validate above assumption is a bit rough.
   I added below code snippet after [InQuotedField](https://github.com/apache/arrow/blob/de7cc1e4820e96531aae3b7f3fc9105f7b3a5f3a/cpp/src/arrow/csv/chunker.cc#L124)
   ```
   InQuotedField:
     if (!quoting) {
       __asm volatile ("pause" ::: "memory");
       __asm volatile ("pause" ::: "memory");
       __asm volatile ("pause" ::: "memory");
     }
   ```
   Logically, these three "pause" instruction should not be available.
   Without this patch, I can find the three continuous pause in disassembled libarrow.so. With this patch, they are gone.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #10859: ARROW-13534: [C++] Improve csv chunker

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #10859:
URL: https://github.com/apache/arrow/pull/10859


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #10859: ARROW-13534: [C++] Improve csv chunker

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #10859:
URL: https://github.com/apache/arrow/pull/10859#issuecomment-891669715


   https://issues.apache.org/jira/browse/ARROW-13534


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org