You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2018/12/17 15:48:28 UTC

[GitHub] bersprockets opened a new pull request #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV tables

bersprockets opened a new pull request #23336: [SPARK-26378][SQL] Restore performance of queries against wide CSV tables
URL: https://github.com/apache/spark/pull/23336
 
 
   ## What changes were proposed in this pull request?
   
   After recent changes to CSV parsing to return partial results for bad CSV records, queries of wide CSV tables slowed considerably. That recent change resulted in every row being recreated, even when the associated input record had no parsing issues and the user specified no corrupt record field in his/her schema
   
   In this PR,  I propose that a row should be recreated only if there is a parsing error or columns need to be shifted due to the existence of a corrupt column field in the user-supplied schema. Otherwise, the row should be used as-is. This restores performance for the non-error case only.
   
   ### Benchmarks:
   
   baseline = commit before partial results change
   PR = this PR
   master = master branch
   
   The wide table has 6000 columns and 165,000 records, and the narrow table has 12 columns and 82,500,000 records. Tests are run with a single executor.
   
   In the following, positive percentages are bad (slower), negative are good (faster).
   
   #### Wide rows, all good records:
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.036489 min | 1.990344 min | 2.952561 min | -2.265882% | 44.982923%
   
   #### Wide rows, all bad records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   1.660761 min | 3.016839 min | 3.011944 min | 81.653994% | 81.359283%
   
   Both my PR and the master branch are ~81% slower than the baseline when all records are bad but the user specified no corrupt record field in his/her schema. In fact, the master branch is reliably, but slightly, faster here, since it does not call badRecord() in this case.
   
   #### Wide rows, corrupt record field, all good records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.912467 min | 2.893039 min | 2.905344 min | -0.667056% | -0.244543%
   
   #### Wide rows, corrupt record field, all bad records
   
   baseline | pr | master | PR diff | master diff
   -----------|-----|-----------|-----------|---------------
   2.441417 min | 2.979544 min | 2.957439 min | 22.041620% | 21.136180%
   
   Both my PR and the master branch are ~21-22% slower than the baseline when all records are bad and the user specified a corrupt record field in his/her schema.
   
   #### Narrow rows, all good records
   
   baseline | pr | master | diff1 | diff2
   -----------|-----|-----------|-----------|---------------
   2.004539 min | 1.987183 min | 2.365122 min | -0.865813% | 17.988343%
   
   #### Narrow rows, corrupt record field, all good records
   
   baseline | pr | master | diff1 | diff2
   -----------|-----|-----------|-----------|---------------
   2.390589 min | 2.382100 min | 2.379733 min | -0.355096% | -0.454095%
   ## How was this patch tested?
   
   All SQL unit tests
   Python core and SQL tests
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org