You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/02/18 16:05:12 UTC

[GitHub] Mister-Meeseeks opened a new pull request #23830: Decrease processing overhead on DataFrameReader CSV calls with specif…

Mister-Meeseeks opened a new pull request #23830: Decrease processing overhead on DataFrameReader CSV calls with specif…
URL: https://github.com/apache/spark/pull/23830
 
 
   …ied schema
   
   Prior to this patch, all DataFrameReader.csv() calls would collect the first
   line from the CSV input iterator. This is done to allow schema inference from the
   header row.
   
   However when schema is already specified this is a wasteful operation. It results
   in an unncessary compute step on the first partition. This can be expensive if
   the CSV itself is expensive to generate (e.g. it's the product of a long-running
   external pipe()).
   
   This patch short-circuits the first-line collection in DataFrameReader.csv() when
   schema is specified. Thereby improving CSV read performance in certain cases.
   
   ## What changes were proposed in this pull request?
   
   (Please fill in changes proposed in this fix)
   
   ## How was this patch tested?
   
   (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
   (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
   
   Please review http://spark.apache.org/contributing.html before opening a pull request.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org