You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/05/10 11:37:12 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #6325: Parallel CSV reading

alamb opened a new issue, #6325:
URL: https://github.com/apache/arrow-datafusion/issues/6325

   ### Is your feature request related to a problem or challenge?
   
   As part of having a great "out of the box" experience it is important to use as many cores as possible in DataFusion. Given modern consumer laptops have 8-16 cores using multiple cores can literally translate to an order of magnitude faster performance.
   
   While DataFusion offers the ability to read partitioned datasets (aka when the input is in multiple files), often, especially for initially testing out the tool,  people will simply running queries on their existing CSV or JSON datasets and it will be relatively slow.
   
   We already have the great `datafusion.optimizer.repartition_file_scans` option (see [docs](https://arrow.apache.org/datafusion/user-guide/configs.html))  -- added by @korowa  in https://github.com/apache/arrow-datafusion/pull/5057 (👋 !) which uses multiple cores to decode the parquet files in parallel. I would like a similar feature for CSV files
   
   
   
   ### Describe the solution you'd like
   
   One basic approach (following what @korowa  did for Parquet) would be:
   1. If the `datafusion.optimizer.repartition_file_scans` option is set, divide the file into even (byte) sized contiguous blocks, probably with some lower limit (like 1MB)
   3. Update [CsvExec](https://github.com/apache/arrow-datafusion/blob/8a25953182b361a58c56384129bf57f32aa2dbb1/datafusion/core/src/physical_plan/file_format/csv.rs#L53)  to process partitions using those subsets of the viles
   
   Notes:
   Given the vagaries of CSV (e.g. unescaped quoted newlines) it is likely impossible to parallelize CSV reading for all possible files. I think this is fine, and as long as we can turn off the reading in parallel it is better to have faster out of the box query performance for 99.99% of the queries than handle bizzare CSV files always
   
   Care will be required to make sure all records are read exactly once, given the partition splits will likely be in the middle of rows. 
   
   One idea for parsing a partition (`offset`,  `len`):
   1. Start CSV parsing the data starting *after* finding the next newline after the `offset` bytes
   2. Continue CSV parsing until the newline *after* the `offset + len` byte 
   
   ```
           0        A,1,2,3,4,5,6,7,8,9\n                            
           20       A,1,2,3,4,5,6,7,8,9\n                            
           40       A,1,2,3,4,5,6,7,8,9\n ◀─ ─ ─ ─ ─ ─ ─ ─           
           60       A,1,2,3,4,5,6,7,8,9\n                 │          
           80       A,1,2,3,4,5,6,7,8,9\n                            
           100      A,1,2,3,4,5,6,7,8,9\n                 │          
                                                                     
   Byte Offset       Lines of CSV Data                    │          
                     (in this case 20                                
                     bytes per line)           Split at byte 50 is in
                                                 the middle of this  
                                                        line         
                                                                     
   ```
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   @kmitchener  noticed the same thing in: https://github.com/apache/arrow-datafusion/issues/5205
   
   The duckdb implementation of a similar feature may offer some inspiration: https://github.com/duckdb/duckdb/pull/5194


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on issue #6325: Parallel CSV reading

Posted by "tustvold (via GitHub)" <gi...@apache.org>.

tustvold commented on issue #6325:
URL: https://github.com/apache/arrow-datafusion/issues/6325#issuecomment-1543628064

   I believe this will require https://github.com/apache/arrow-rs/issues/2241, in particular the ability to a streaming byte range get. I will add this to my list


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #6325: Parallel CSV reading

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #6325: Parallel CSV reading
URL: https://github.com/apache/arrow-datafusion/issues/6325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org