You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/02/03 00:34:01 UTC
[GitHub] [arrow] westonpace commented on issue #34010: [Python] Dataset Schema Infer Depth - Maximum Number of Rows
westonpace commented on issue #34010:
URL: https://github.com/apache/arrow/issues/34010#issuecomment-1414553376
It is not obvious but it is possible. The `block_size` (in [`pyarrow.csv.ReadOptions`](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ReadOptions.html#pyarrow.csv.ReadOptions) is what actually determines our inference depth). In order to specify a custom read options you will need to create a [`pyarrow.dataset.CsvFileFormat`](https://arrow.apache.org/docs/python/generated/pyarrow.dataset.CsvFileFormat.html#pyarrow.dataset.CsvFileFormat).
Regrettably the inference depth is always somewhat tied in with our I/O performance. However, I suspect you can bump up the default quite a bit before you start to notice significant effects.
A complete example:
```
import pyarrow as pa
import pyarrow.csv as csv
import pyarrow.dataset as ds
MiB = 1024*1024
read_options = csv.ReadOptions(block_size=16*MiB) # Note, the default is 1MiB
csv_format = ds.CsvFileFormat(read_options=read_options)
my_dataset = ds.dataset('/tmp/my_dataset', format=csv_format)
print(my_dataset.to_table())
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org