You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/02/03 00:34:01 UTC

[GitHub] [arrow] westonpace commented on issue #34010: [Python] Dataset Schema Infer Depth - Maximum Number of Rows

westonpace commented on issue #34010:
URL: https://github.com/apache/arrow/issues/34010#issuecomment-1414553376

   It is not obvious but it is possible.  The `block_size` (in [`pyarrow.csv.ReadOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions) is what actually determines our inference depth).  In order to specify a custom read options you will need to create a [`pyarrow.dataset.CsvFileFormat`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.CsvFileFormat.html#pyarrow.dataset.CsvFileFormat).
   
   Regrettably the inference depth is always somewhat tied in with our I/O performance.  However, I suspect you can bump up the default quite a bit before you start to notice significant effects.
   
    A complete example:
   
   ```
   import pyarrow as pa
   import pyarrow.csv as csv
   import pyarrow.dataset as ds
   
   MiB = 1024*1024
   
   read_options = csv.ReadOptions(block_size=16*MiB) # Note, the default is 1MiB                                                                                                                                      
   csv_format = ds.CsvFileFormat(read_options=read_options)
   
   my_dataset = ds.dataset('/tmp/my_dataset', format=csv_format)
   print(my_dataset.to_table())
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org