You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/22 01:32:54 UTC

[GitHub] [arrow] westonpace commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

westonpace commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r831688034



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.

Review comment:
       ```suggestion
   the maximum number of open files allowed during the write.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files

Review comment:
       ```suggestion
   system. The default value is 900 which allows some number of files
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.

Review comment:
       Again, this setting isn't really mean to control the maximum number of files opened.

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.

Review comment:
       As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large.  Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.

Review comment:
       ```suggestion
   less than this value if other options such as ``max_open_files`` or 
   ``max_rows_per_file`` force smaller row group sizes.
   ```
   
   I think it is an error if `max_rows_per_file` is less than `min_rows_per_group`.

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 

Review comment:
       Can you reword this paragraph?  I'm having a hard time understanding the intent.  Also, I'm not sure I recognize the term "mini-batch" (at least, not in a way that would apply here).

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your

Review comment:
       ```suggestion
   ``max_open_files`` setting or increase the file handler limit on your
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.

Review comment:
       `min_rows_per_group` isn't really intended to control the maximum number of files opened.  This might be confusing since we have `max_open_files`.

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquet and Feather/IPC
+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 

Review comment:
       What happens if the dataset is JSON or CSV and this is set?  Is it an error or is this property ignored?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org