You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/10 10:58:37 UTC

[GitHub] [arrow] vibhatha opened a new pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

vibhatha opened a new pull request #12112:
URL: https://github.com/apache/arrow/pull/12112


   This PR includes a minor documentation update for showing how `max_open_files`, `min_rows_per_group` and `max_rows_per_group` parameters can be used in Python dataset API. 
   
   The disucssion on the issue: https://issues.apache.org/jira/browse/ARROW-15183


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r831207882



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC

Review comment:
       I think Feather and IPC are the same format.
   ```suggestion
   In addition row_groups are a factor which impacts write/read of Parquet and Feather/IPC
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r836832020



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.

Review comment:
       Ah, I have no experience with Spark / S3 so that could be entirely true.  Maybe we could just change that sentence into "leading to out-of-memory errors in downstream readers that don't support partial-file reads"

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.

Review comment:
       Ah, I have no experience with Spark so that could be entirely true.  Maybe we could just change that sentence into "leading to out-of-memory errors in downstream readers that don't support partial-file reads"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#issuecomment-1068587522


   @wjones127 I wasn't exactly sure about not committing the changes to the test submodule. I will check this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r831709424



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.

Review comment:
       > As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large.
   Are we assuming downstream readers are necessarily Arrow? I suggested that based on my experience with Spark, which as I recall, read whole files.
   
   > Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?
   In retrospect, that guidance is a bit low. My previous heuristic target was between 50 MB per file at a minimum and 2 GB as a maximum. That might be more specific to a Spark / S3 context; so maybe not as appropriate here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r831207882



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC

Review comment:
       ```suggestion
   In addition row_groups are a factor which impacts write/read of Parquet, Feather and IPC
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r836829845



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+

Review comment:
       Sorry I missed this.  This should help.  Multi threading does cause the write_dataset call to be "jittery" but not completely random so this would help with the small files problem though you might still get one or two here and there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r782597375



##########
File path: docs/source/python/dataset.rst
##########
@@ -699,3 +699,46 @@ Parquet files:
 
     # also clean-up custom base directory used in some examples
     shutil.rmtree(str(base), ignore_errors=True)
+
+
+Configuring files open during a write

Review comment:
       I think these sections would actually be best placed next to the "Partitioning performance considerations" section at https://github.com/apache/arrow/blame/master/docs/source/python/dataset.rst#L578
   
   

##########
File path: docs/source/python/dataset.rst
##########
@@ -699,3 +699,46 @@ Parquet files:
 
     # also clean-up custom base directory used in some examples
     shutil.rmtree(str(base), ignore_errors=True)
+
+
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+The number of files opened at during the write time can be set as follows;
+
+.. ipython:: python
+
+    ds.write_dataset(data=table, base_dir="data_dir", max_open_files=max_open_files)
+
+The maximum number of rows per file can be set as follows;
+
+.. ipython:: pythoin
+    ds.write_dataset(record_batch, "data_dir", format="parquet",

Review comment:
       I think that space between the directive and code is required.
   
   ```suggestion
   .. ipython:: python
   
       ds.write_dataset(record_batch, "data_dir", format="parquet",
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#issuecomment-1084745399


   > Let's delete that file size guidance for now. Otherwise I approve.
   
   @wjones127 updated. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r826595344



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 

Review comment:
       This is another overlapping suggestion. I think the previous suggestion contained the expected idea. What do you think @wjones127? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r831240137



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC
+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 

Review comment:
       I think it could use a little more direct advice to help users see the symptoms of when they've done something wrong. Here's my suggestion:
   
   > Row groups are build into the Parquet and IPC/Feather formats, but don't affect JSON or CSV. When reading back Parquet and IPC formats in Arrow, the row group boundaries become the record batch boundaries, determining the default batch size of downstream readers. Additionally, row groups in Parquet files have column statistics which can help readers skip irrelevant data but can add size to the file. As an extreme example, if one sets `max_rows_per_group=1` in Parquet, they will have large files because most of the file will be row group statistics. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r839715591



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,78 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the maximum number of open files allowed during the write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increase the file handler limit on your
+system. The default value is 900 which allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.

Review comment:
       ```suggestion
   and how well compressed (if at all) the data is.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r827096582



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,72 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 

Review comment:
       ```suggestion
   important to optimize the writes, such as the number of rows per file and
   the number of files open during write. 
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of files opened with the ``max_rows_per_files`` parameter of
+:meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one 
+file will be created in each output directory unless files need to be closed to respect 
+``max_open_files``. 
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+

Review comment:
       Those are good examples.
   
   Could you add a paragraph discussing how `row_groups` affect later reads for Parquet and Feather/IPC, but not CSV or JSON? 

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,72 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one 
+file will be created in each output directory unless files need to be closed to respect 
+``max_open_files``. This setting is the primary way to control file size. 
+For workloads writing a lot of data files can get very large without a 

Review comment:
       ```suggestion
   For workloads writing a lot of data, files can get very large without a
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r826587938



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 

Review comment:
       @wjones127 this is a better description. Let me add it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#issuecomment-1073017100


   @wjones127 I think this was a mistake from my end. Sorry about the confusion on committing the submodule. 
   I corrected it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#issuecomment-1017096397


   @wjones127 Nice points. I will work on these ideas. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r832122469



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquet and Feather/IPC
+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 

Review comment:
       I think this is a good explanation... 
   
   https://github.com/apache/arrow/pull/12112#discussion_r831240137




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r830489459



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC
+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 

Review comment:
       @wjones127 I added a small para on row-groups. Is this helpful? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r831688034



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.

Review comment:
       ```suggestion
   the maximum number of open files allowed during the write.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files

Review comment:
       ```suggestion
   system. The default value is 900 which allows some number of files
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.

Review comment:
       Again, this setting isn't really mean to control the maximum number of files opened.

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.

Review comment:
       As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large.  Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.

Review comment:
       ```suggestion
   less than this value if other options such as ``max_open_files`` or 
   ``max_rows_per_file`` force smaller row group sizes.
   ```
   
   I think it is an error if `max_rows_per_file` is less than `min_rows_per_group`.

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 

Review comment:
       Can you reword this paragraph?  I'm having a hard time understanding the intent.  Also, I'm not sure I recognize the term "mini-batch" (at least, not in a way that would apply here).

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your

Review comment:
       ```suggestion
   ``max_open_files`` setting or increase the file handler limit on your
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.

Review comment:
       `min_rows_per_group` isn't really intended to control the maximum number of files opened.  This might be confusing since we have `max_open_files`.

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquet and Feather/IPC
+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 

Review comment:
       What happens if the dataset is JSON or CSV and this is set?  Is it an error or is this property ignored?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r836838039



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,76 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the maximum number of open files allowed during the write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increase the file handler limit on your
+system. The default value is 900 which allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The volume of data written to the disk per each group can be configured.
+This configuration includes a lower and an upper bound. 
+Set the minimum number of rows required to form a row group. 
+Defined with the ``min_rows_per_group`` parameter of :meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value if other options such as ``max_open_files`` or 
+``max_rows_per_file`` force smaller row group sizes.
+
+Set the maximum number of rows allowed per group. Defined as ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.

Review comment:
       ```suggestion
   The maximum number of rows allowed per group is defined with the
   ``max_rows_per_group`` parameter of :meth:`write_dataset`.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,76 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the maximum number of open files allowed during the write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increase the file handler limit on your
+system. The default value is 900 which allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The volume of data written to the disk per each group can be configured.
+This configuration includes a lower and an upper bound. 
+Set the minimum number of rows required to form a row group. 
+Defined with the ``min_rows_per_group`` parameter of :meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value if other options such as ``max_open_files`` or 
+``max_rows_per_file`` force smaller row group sizes.
+
+Set the maximum number of rows allowed per group. Defined as ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 

Review comment:
       There are "[notes](https://sublime-and-sphinx-guide.readthedocs.io/en/latest/notes_warnings.html)" in RST format:
   
   ```
   .. note:
     
     This is the note content
   
   Resume regular content
   ```
   
   Perhaps we should use that instead of a regular paragraph here?

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,76 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the maximum number of open files allowed during the write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increase the file handler limit on your
+system. The default value is 900 which allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The volume of data written to the disk per each group can be configured.
+This configuration includes a lower and an upper bound. 
+Set the minimum number of rows required to form a row group. 
+Defined with the ``min_rows_per_group`` parameter of :meth:`write_dataset`.

Review comment:
       ```suggestion
   This configuration includes a lower and an upper bound.
   
   The minimum number of rows required to form a row group is 
   defined with the ``min_rows_per_group`` parameter of :meth:`write_dataset`.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,76 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the maximum number of open files allowed during the write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increase the file handler limit on your
+system. The default value is 900 which allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The volume of data written to the disk per each group can be configured.
+This configuration includes a lower and an upper bound. 
+Set the minimum number of rows required to form a row group. 
+Defined with the ``min_rows_per_group`` parameter of :meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value if other options such as ``max_open_files`` or 
+``max_rows_per_file`` force smaller row group sizes.
+
+Set the maximum number of rows allowed per group. Defined as ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+Row groups are built into the Parquet and IPC/Feather formats but don't affect JSON or CSV.

Review comment:
       ```suggestion
   row groups (e.g. if the incoming row group size is just barely larger than this value).
   
   Row groups are built into the Parquet and IPC/Feather formats but don't affect JSON or CSV.
   ```
   
   I think there is a "note" saying "make sure to set both properties" and then a paragraph talking about how row groups affect downstream readers.  We should, at a minimum, make them two different paragraphs (though we might also make the note a proper "note").




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] westonpace commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
westonpace commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r836834706



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC
+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 

Review comment:
       No, I think this is probably ok.  Thinking on it further my guess is the user would assume these properties are just plain ignored if writing CSV or JSON which is (more or less) what happens.  So I think this is clear enough.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#issuecomment-1008757286






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r826599846



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of files opened with the ``max_rows_per_files`` parameter of
+:meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one 
+file will be created in each output directory unless files need to be closed to respect 
+``max_open_files``. 
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+

Review comment:
       I guess we can think of logging activities where online activities are monitored in windows (window aggregations) and summaries are logged by computing on those aggregated values. So if we assume such a scenario, depending on the accuracy required for the computation (if it is a learning task) and the required performance optimizations (execution time and memory), the users should be able to tune the parameter. This could be an interesting blog article if we can demonstrate it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#issuecomment-1010765139


   @wjones127 thanks for the review, I will update the PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r831709424



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.

Review comment:
       > As long as the user is creating multiple (reasonably sized) row groups we shouldn't get out-of-memory errors even if the file is very large.
   
   Are we assuming downstream readers are necessarily Arrow? I suggested that based on my experience with Spark, which as I recall, read whole files.
   
   Also, what evidence do you have for "For most applications, it's best to keep file sizes below 1GB"?
   In retrospect, that guidance is a bit low. My previous heuristic target was between 50 MB per file at a minimum and 2 GB as a maximum. That might be more specific to a Spark / S3 context; so maybe not as appropriate here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r832126926



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,77 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, such as the number of rows per file and
+the number of files open during write.
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. This only applies to writing partitioned
+datasets, where rows are dispatched to the appropriate file depending on their
+partition values. If an attempt is made to open too many  files then the least
+recently used file will be closed.  If this setting is set too low you may end
+up fragmenting your data into many small files.
+
+If your process is concurrently using other file handlers, either with a 
+dataset scanner or otherwise, you may hit a system file handler limit. For 
+example, if you are scanning a dataset with 300 files and writing out to
+900 files, the total of 1200 files may be over a system limit. (On Linux,
+this might be a "Too Many Open Files" error.) You can either reduce this
+``max_open_files`` setting or increasing your file handler limit on your
+system. The default value is 900 which also allows some number of files
+to be open by the scanner before hitting the default Linux limit of 1024. 
+
+Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of rows written in each file with the ``max_rows_per_files``
+parameter of :meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one
+file will be created in each output directory unless files need to be closed to respect
+``max_open_files``. This setting is the primary way to control file size.
+For workloads writing a lot of data, files can get very large without a
+row count cap, leading to out-of-memory errors in downstream readers. The
+relationship between row count and file size depends on the dataset schema
+and how well compressed (if at all) the data is. For most applications,
+it's best to keep file sizes below 1GB.
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+
+Set the maximum number of files opened with the ``min_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``min_rows_per_group`` is set greater than 0 then this will cause the 
+dataset writer to batch incoming data and only write the row groups to the 
+disk when sufficient rows have accumulated. The final row group size may be 
+less than this value and other options such as ``max_open_files`` or 
+``max_rows_per_file`` lead to smaller row group sizes.
+
+Set the maximum number of files opened with the ``max_rows_per_group`` parameter of
+:meth:`write_dataset`.
+
+Note: if ``max_rows_per_group`` is set greater than 0 then the dataset writer may split 
+up large incoming batches into multiple row groups.  If this value is set then 
+``min_rows_per_group`` should also be set or else you may end up with very small 
+row groups (e.g. if the incoming row group size is just barely larger than this value).
+In addition row_groups are a factor which impacts write/read of Parquest, Feather and IPC
+formats. The main purpose of these formats are to provide high performance data structures
+for I/O operations on larger datasets. The row_group concept allows the write/read operations
+to be optimized and gather a defined number of rows at once and execute the I/O operation. 
+But row_groups are not integrated to support JSON or CSV formats. 

Review comment:
       This one is much better. I replaced my content with this. @westonpace should we enhance further about CSV and JSON?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r786377629



##########
File path: docs/source/python/dataset.rst
##########
@@ -699,3 +699,46 @@ Parquet files:
 
     # also clean-up custom base directory used in some examples
     shutil.rmtree(str(base), ignore_errors=True)
+
+
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+The number of files opened at during the write time can be set as follows;
+
+.. ipython:: python
+
+    ds.write_dataset(data=table, base_dir="data_dir", max_open_files=max_open_files)
+
+The maximum number of rows per file can be set as follows;
+
+.. ipython:: pythoin
+    ds.write_dataset(record_batch, "data_dir", format="parquet",

Review comment:
       I replaced this with the short description on the option name and it's functionality ( minor modification to C++ docstring and added here) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] vibhatha commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
vibhatha commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r786377402



##########
File path: docs/source/python/dataset.rst
##########
@@ -699,3 +699,46 @@ Parquet files:
 
     # also clean-up custom base directory used in some examples
     shutil.rmtree(str(base), ignore_errors=True)
+
+
+Configuring files open during a write

Review comment:
       I moved the docs to this section. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wjones127 commented on a change in pull request #12112: ARROW-15183: [Python][Docs] Add Missing Dataset Write Options

Posted by GitBox <gi...@apache.org>.
wjones127 commented on a change in pull request #12112:
URL: https://github.com/apache/arrow/pull/12112#discussion_r788004898



##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of files opened with the ``max_rows_per_files`` parameter of
+:meth:`write_dataset`.

Review comment:
       ```suggestion
   Set the maximum number of rows written in each file with the ``max_rows_per_files`` parameter of
   :meth:`write_dataset`.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+

Review comment:
       @westonpace does my understand below sound correct? I know it's a little complicated with multi-threading
   
   ```suggestion
   
   To mitigate the many-small-files problem caused by this limit, you can 
   also sort your data by the partition columns (assuming it is not already
   sorted). This ensures that files are usually closed after all data for
   their respective partition has been written.
   
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of files opened with the ``max_rows_per_files`` parameter of
+:meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one 
+file will be created in each output directory unless files need to be closed to respect 
+``max_open_files``. 

Review comment:
       ```suggestion
   ``max_open_files``. This setting is the primary way to control file size. 
   For workloads writing a lot of data files can get very large without a 
   row count cap, leading to out-of-memory errors in downstream readers. The 
   relationship between row count and file size depends on the dataset schema
   and how well compressed (if at all) the data is. For most applications,
   it's best to keep file sizes below 1GB.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 

Review comment:
       We should probably mention that this setting applies to partitioned datasets:
   
   ```suggestion
   If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
   number of files that can be left open. This only applies to writing partitioned
   datasets, where rows are dispatched to the appropriate file depending on their
   partition values. If an attempt is made to open too many  files then the least
   recently used file will be closed.  If this setting is set 
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is ``max_rows_per_file``. 

Review comment:
       ```suggestion
   Another important configuration used in :meth:`write_dataset` is ``max_rows_per_file``.
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 

Review comment:
       If we can, let's eliminate "Modify this value depending on the nature of write operations associated with the usage" and replace with more specific advice.
   
   ```suggestion
   If your process is concurrently using other file handlers, either with a 
   dataset scanner or otherwise, you may hit a system file handler limit. For 
   example, if you are scanning a dataset with 300 files and writing out to
   900 files, the total of 1200 files may be over a system limit. (On Linux,
   this might be a "Too Many Open Files" error.) You can either reduce this
   ``max_open_files`` setting or increasing your file handler limit on your
   system. The default value is 900 which also allows some number of files
   to be open by the scanner before hitting the default Linux limit of 1024. 
   ```

##########
File path: docs/source/python/dataset.rst
##########
@@ -613,6 +613,60 @@ guidelines apply. Row groups can provide parallelism when reading and allow data
 based on statistics, but very small groups can cause metadata to be a significant portion
 of file size. Arrow's file writer provides sensible defaults for group sizing in most cases.
 
+Configuring files open during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to the disk, there are a few parameters that can be 
+important to optimize the writes, i.e number of rows per file and
+number of files open during write. 
+
+Set the maximum number of files opened with the ``max_open_files`` parameter of
+:meth:`write_dataset`.
+
+If  ``max_open_files`` is set greater than 0 then this will limit the maximum 
+number of files that can be left open. If an attempt is made to open too many 
+files then the least recently used file will be closed.  If this setting is set 
+too low you may end up fragmenting your data into many small files.
+
+The default value is 900 which also allows some number of files to be open 
+by the scannerbefore hitting the default Linux limit of 1024. Modify this value 
+depending on the nature of write operations associated with the usage. 
+
+Another important configuration used in `write_dataset` is ``max_rows_per_file``. 
+
+Set the maximum number of files opened with the ``max_rows_per_files`` parameter of
+:meth:`write_dataset`.
+
+If ``max_rows_per_file`` is set greater than 0 then this will limit how many 
+rows are placed in any single file. Otherwise there will be no limit and one 
+file will be created in each output directory unless files need to be closed to respect 
+``max_open_files``. 
+
+Configuring rows per group during a write
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When writing data to disk, depending on the volume of data obtained, 
+(in a mini-batch setting where, records are obtained in batch by batch)
+the volume of data written to disk per each group can be configured. 
+This can be configured using a minimum and maximum parameter. 
+

Review comment:
       A few points worth discussing:
   
    * Row groups matter for Parquet and Feather/IPC; they affect how data is seen by reader and because of row group statistics can affect file size.
    * Row groups are just batch size for CSV / JSON; the readers aren't affected.
   
   My impression is that we have reasonable default for these values, and users generally won't want to set these. Can you think of examples where we would recommend users adjust these values?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org