You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/07 14:50:28 UTC

[GitHub] [arrow] amol- opened a new pull request #10266: [Doc][Python] Improve documentation regarding dealing with memory mapped files

amol- opened a new pull request #10266:
URL: https://github.com/apache/arrow/pull/10266


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou closed pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

pitrou closed pull request #10266:
URL: https://github.com/apache/arrow/pull/10266


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r636903715



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory

Review comment:
       I rephrased it to make it more clear that in absolute terms you won't be consuming fewer memory, but the system will be able to more easily page it out.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r636905643



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python

Review comment:
       I don't have a strong opinion about using `ipython` blocks or just code blocks. I used them just for consistency with the rest of the document.
   
   I think that they have a value in at least verifying that the code you provided as an example can actually execute (even thought it might lead to different results) which makes more easy to catch examples that became invalid due to api changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674825533



##########
File path: docs/source/python/parquet.rst
##########
@@ -112,6 +112,29 @@ In general, a Python file object will have the worst read performance, while a
 string file path or an instance of :class:`~.NativeFile` (especially memory
 maps) will perform the best.
 
+.. _parquet_mmap:
+
+Reading Parquet and Memory Mapping
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because Parquet data needs to be decoded from the parquet format 

Review comment:
       :+1: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r657125159



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       Sorry for the delay. Basically, reading and writing Arrow arrays uses the IPC layer, so I do think it would be better in `ipc.rst`. `memory.rst` is only dealing with raw bytes data.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674841706



##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       This is true, but the text above always talks about record batches specifically. That said, "Arrow data" would be fine to me (bare "data" is too general IMHO).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r631870514



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory

Review comment:
       I see what you mean. What I was trying to say is that Arrow doesn't have to allocate memory itself as it can directly point to the memory mapped buffer which allocation is managed by the system. Also the memory mapped buffer can be paged out more easily by the system without write back cost as it's not flagged as dirty memory, thus allowing to deal with files bigger than memory even in the absence of a swap file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674000017



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile2.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for chunk in loaded_array[0].iterchunks():
+                batch = pa.record_batch([chunk], schema)
+                writer.write(batch)
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Most high level APIs like :meth:`~pyarrow.parquet.read_table` also provide a
+``memory_map`` option. But in those cases, the memory mapping can't help with
+reducing resident memory consumption. Because Parquet data needs to be decoded
+from the parquet format and compression, it can't be directly mapped from disk,
+thus the ``memory_map`` option might perform better on some systems but won't
+help much with resident memory consumption.

Review comment:
       Added a short note in the IPC section to mention that `memory_map` option might be available in other readers and then created a short section into the Parquet page which the note links to.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#issuecomment-884216675


   @jorisvandenbossche @pitrou I think I did my best to address the remaining comments, could you give another look? :D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r668616661



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it

Review comment:
       @jorisvandenbossche the memory mapping only provides benefits if you don't alter data. If you applied any transformation to the data, the data wouldn't be equal to the one on disk anymore and thus you would lose all benefits of memory mapping because the kernel would no longer be able to use the memory mapped file instead of the swap file when in need to page out. I guess we can remove the "write back" section of the documentation if you think it doesn't provide much value.
   
   My primary goal was mostly to say "if you need to open a big ipc format file, open it using memory mapping or you will just face a OOM" the writing back section was to reinforce the concept but doesn't really add any additional value. I'm mostly interested in shipping an addition to the docs that documents that concept somewhere and I don't want to make perfect the enemy of good enough so I'm ok with deferring to other PRs any part that doesn't get obvious consensus.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r673995490



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the memory mapped data without the need to make copies of it

Review comment:
       I got rid of the "write back" section, in the end it wasn't providing much practical benefit for the reader as the cases where you write back unmodified data are very uncommon.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r657126453



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python

Review comment:
       In this case in particular, we're creating and serializing a large amount of data, so that would really seem to add some cost to building the docs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r657804510



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.

Review comment:
       I don't know what user we target here, but "page in" and "page out" is not commonly understood I think (of course, we can't start explaining in detail how memory works here, but I think this section will be typically read by people who might not fully understand what memory mapping is / how it works, but just think they can use it to avoid memory allocation).
   

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it

Review comment:
       I agree it would be nice to show a more "interesting" example where memory-mapping can be useful.
   
   It's also not really clear to me (as a memory-mapping noob) what the above exactly means if you do some operation on the memory mapped data (that is not just writing it back unmodified).

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to

Review comment:
       ```suggestion
   Arrow can directly reference the data mapped from disk and avoid having to
   ```
   
   (or "pyarrow")




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r668602824



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory

Review comment:
       there are benefits because if you are only reading the data (suppose to compute means or whatever on it) if you are using memory mapping and you have to read more data than it fits in your memory, the kernel can swap out the pages no longer in use without any cost of writing them into the swap, because they are already available in the file that was memory mapped and thus can be paged back in directly from the memory mapping.
   
   On the other side, if you were relying on swap and had read the file normally, when the data you have to read doesn't fit into memory, the kernel will have to incur into the cost of writing it to the swap, otherwise it would be unable to page it out as there wouldn't be any copy (as far as the memory manager is concerned) that would allow to page it back in.
   
   So memory mapping prevents the cost of _writing to the swap file_ when you are exhausting memory.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r668607645



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python

Review comment:
       Reduced the size of the example to 1/10th




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r673999027



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to

Review comment:
       :+1: 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r630837422



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python

Review comment:
       I know we've started using those `ipython` blocks, but I'm really not fond of them. They seem to make building docs quite slower (especially if the workload is non-trivial).
   
   @jorisvandenbossche What do you think?

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory

Review comment:
       This is rather misleading. The data is loaded back to memory when it is being read. It's just that it's read lazily, so the costs are not paid up front (and the cost is not paid for data that is not accessed).
   
   What memory mapping can avoid is an intermediate copy when reading the data. So it is more performant in that sense.

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the memory mapped data without the need to make copies of it

Review comment:
       This seems a bit misleading again. First, I don't understand the point of writing back data that's read from a memory-mapped file (just copy the file if that's what you want to do?). Second, the fact that writing data doesn't consume additional memory has nothing to do with the fact that the data is memory-mapped, AFAICT.

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format

Review comment:
       Always "Arrow" capitalized.

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       It seems this should go into `ipc.rst`. `memory.rst` is about the low-level memory and IO APIs.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r631870514



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory

Review comment:
       I see what you mean. What I was trying to say is that Arrow doesn't have to allocate memory itself as it can directly point to the memory mapped buffer which allocation is managed by the system. Also the memory mapped buffer can be paged out more easily by the system without write back cost as it's not flagged as dirty memory, thus allowing to deal with files bigger than memory even in the absence of a swap file. I'll try to rephrase this in a less misleading way.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r668599462



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       moved to `ipc.rst` as suggested




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r631872202



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the memory mapped data without the need to make copies of it

Review comment:
       The point was to show that the `total_allocated_bytes` remains zero even if we had to iterate over all data to write it back as we still referenced the memory mapped buffer without having to perform any copy of it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674824095



##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 10M integers, we could write it in 1000 chunks
+of 10000 entries:
+
+.. ipython:: python
+
+      BATCH_SIZE = 10000
+      NUM_BATCHES = 1000
+
+      schema = pa.schema([pa.field('nums', pa.int32())])
+
+      with pa.OSFile('bigfile.arrow', 'wb') as sink:
+         with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                  batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                  writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+      with pa.OSFile('bigfile.arrow', 'rb') as source:
+         loaded_array = pa.ipc.open_file(source).read_all()
+
+      print("LEN:", len(loaded_array))
+      print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+Arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+      with pa.memory_map('bigfile.arrow', 'r') as source:

Review comment:
       :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#issuecomment-834482409


   https://issues.apache.org/jira/browse/ARROW-12650


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r657791375



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python

Review comment:
       Personally I would prefer to have *some* way to still verify the example, but this doesn't need to be with the IPython directive (which actually only ensures the code runs without error, not that the output is correct). This has come up before as well, so I opened a separate JIRA to discuss this in general: https://issues.apache.org/jira/browse/ARROW-13159




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #10266: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#issuecomment-834478225


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r658187770



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it

Review comment:
       If you want an example with a clear performance benefit then a partial IPC read is a pretty good one.  The IPC reader today does not column selection into I/O filtering.  In other words, even if you read only a few columns it will still "read" the entire file.  Since it doesn't access the memory for the undesired columns you can see a benefit in memory mapping.
   
   One could conceivably implement a smarter IPC reader that does selectively read I/O (with prebuffering) and probably close the gap somewhat although the overhead of figuring out and issuing all the small reads may still leave an advantage to the memory mapped method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r658185713



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the memory mapped data without the need to make copies of it

Review comment:
       If you use the datasets API to scan a dataset (that isn't memory mapped) and then write it back out (repartitioned, filtered, etc.) it shouldn't require more than a couple hundred MB of working RAM (controllable by readahead) regardless of dataset size.
   
   There will be one additional copy (incurred at read time to copy from kernel space to user space) compared to memory mapped files but other than that the two will be similar.  In practice this additional copy happens in parallel with the rest of the work and doesn't make a noticeable runtime difference.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- edited a comment on pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- edited a comment on pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#issuecomment-884216675


   @jorisvandenbossche @pitrou I think I did my best to address the remaining comments, could you take another look? :D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r673998283



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it

Review comment:
       As mentioned in https://github.com/apache/arrow/pull/10266#discussion_r673995490 I got rid of the "write back" section which seemed to just open a can of worms without providing much practical value for a reader as it was. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r636905643



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python

Review comment:
       I don't have a strong opinione about using `ipython` blocks or just code blocks. I used them just for consistency with the rest of the document.
   
   I think that they have a value in at least verifying that the code you provided as an example can actually execute (even thought it might lead to different results) which makes more easy to catch examples that became invalid due to api changes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r675429058



##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

jorisvandenbossche commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r657797354



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python

Review comment:
       For this specific example, we could also make it smaller? (which is still OK for educational purposes I think?)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r658190554



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the array can directly reference the data on disk and avoid copying it to memory.
+In such case the memory consumption is greatly reduced and it's possible to read
+arrays bigger than the total memory

Review comment:
       Would memory mapping be more efficient than a system with swap enabled?  You mention that there are potential write back savings but why would the page be flagged as dirty in a swap scenario?  In either case it seems we are talking about read only access to a larger than physical memory file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r631867852



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       I'm open to suggestions about where to best put this. I picked `memory.rst` because the title is "Memory and IO Interfaces" and it covers input streams, memory mapped files and reading memory mapped data.

##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,95 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw arrow data, we can use the Arrow File Format

Review comment:
       :+1: will change this.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674778637



##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 10M integers, we could write it in 1000 chunks
+of 10000 entries:
+
+.. ipython:: python
+
+      BATCH_SIZE = 10000
+      NUM_BATCHES = 1000
+
+      schema = pa.schema([pa.field('nums', pa.int32())])
+
+      with pa.OSFile('bigfile.arrow', 'wb') as sink:
+         with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                  batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                  writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+      with pa.OSFile('bigfile.arrow', 'rb') as source:
+         loaded_array = pa.ipc.open_file(source).read_all()
+
+      print("LEN:", len(loaded_array))
+      print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+Arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+      with pa.memory_map('bigfile.arrow', 'r') as source:

Review comment:
       Can we be consistent wrt `'r'` vs. `'rb'` above?

##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 10M integers, we could write it in 1000 chunks
+of 10000 entries:
+
+.. ipython:: python

Review comment:
       I'm still lukewarm about using `ipython` blocks here.

##########
File path: docs/source/python/parquet.rst
##########
@@ -112,6 +112,29 @@ In general, a Python file object will have the worst read performance, while a
 string file path or an instance of :class:`~.NativeFile` (especially memory
 maps) will perform the best.
 
+.. _parquet_mmap:
+
+Reading Parquet and Memory Mapping
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because Parquet data needs to be decoded from the parquet format 

Review comment:
       Use "Parquet" (titlecased) consistently?

##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 10M integers, we could write it in 1000 chunks
+of 10000 entries:
+
+.. ipython:: python
+
+      BATCH_SIZE = 10000
+      NUM_BATCHES = 1000
+
+      schema = pa.schema([pa.field('nums', pa.int32())])
+
+      with pa.OSFile('bigfile.arrow', 'wb') as sink:
+         with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                  batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                  writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+      with pa.OSFile('bigfile.arrow', 'rb') as source:
+         loaded_array = pa.ipc.open_file(source).read_all()
+
+      print("LEN:", len(loaded_array))
+      print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+Arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+      with pa.memory_map('bigfile.arrow', 'r') as source:
+         loaded_array = pa.ipc.open_file(source).read_all()
+      print("LEN:", len(loaded_array))
+      print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+.. note::
+
+   Most high level APIs like :meth:`~pyarrow.parquet.read_table` also provide a

Review comment:
       I would say "Other high-level APIs" rather than "most".

##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       Say "record batches" instead of "arrays"?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r658188518



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it

Review comment:
       I did some experiments here https://github.com/apache/arrow/issues/10138 which demonstrate the effect.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674832346



##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 10M integers, we could write it in 1000 chunks
+of 10000 entries:
+
+.. ipython:: python

Review comment:
       I'm not fond of `ipython` directive too, but we have a dedicated Jira Issue ( https://issues.apache.org/jira/browse/ARROW-13159 ), for now I adhered to what seemed to be the practice in the rest of that file.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r657128479



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile2.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for chunk in loaded_array[0].iterchunks():
+                batch = pa.record_batch([chunk], schema)
+                writer.write(batch)
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Most high level APIs like :meth:`~pyarrow.parquet.read_table` also provide a
+``memory_map`` option. But in those cases, the memory mapping can't help with
+reducing resident memory consumption. Because Parquet data needs to be decoded
+from the parquet format and compression, it can't be directly mapped from disk,
+thus the ``memory_map`` option might perform better on some systems but won't
+help much with resident memory consumption.

Review comment:
       IMHO, it would be better to put the discussion of Parquet options in the Parquet chapter.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

westonpace commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r658187770



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it

Review comment:
       If you want an example with a clear performance benefit then a partial IPC read is a pretty good one.  The IPC reader today does not push column selection into I/O filtering.  In other words, even if you read only a few columns it will still "read" the entire file.  Since it doesn't access the memory for the undesired columns you can see a benefit in memory mapping.
   
   One could conceivably implement a smarter IPC reader that does selectively read I/O (with prebuffering) and probably close the gap somewhat although the overhead of figuring out and issuing all the small reads may still leave an advantage to the memory mapped method.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674824836



##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 10M integers, we could write it in 1000 chunks
+of 10000 entries:
+
+.. ipython:: python
+
+      BATCH_SIZE = 10000
+      NUM_BATCHES = 1000
+
+      schema = pa.schema([pa.field('nums', pa.int32())])
+
+      with pa.OSFile('bigfile.arrow', 'wb') as sink:
+         with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                  batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                  writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+      with pa.OSFile('bigfile.arrow', 'rb') as source:
+         loaded_array = pa.ipc.open_file(source).read_all()
+
+      print("LEN:", len(loaded_array))
+      print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+Arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+      with pa.memory_map('bigfile.arrow', 'r') as source:
+         loaded_array = pa.ipc.open_file(source).read_all()
+      print("LEN:", len(loaded_array))
+      print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+.. note::
+
+   Most high level APIs like :meth:`~pyarrow.parquet.read_table` also provide a

Review comment:
       :+1:




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] pitrou commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

pitrou commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r657127966



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.
+
+.. ipython:: python
+
+    with pa.memory_map('bigfile.arrow', 'r') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+Equally we can write back to disk the loaded array without consuming any
+extra memory thanks to the fact that iterating over the array will just
+scan through the data without the need to make copies of it

Review comment:
       While this is true, I'm not sure the use case of writing back identical memory-mapped data is very interesting to talk about. Usually, you would write back data after some amount of processing (e.g. filtering).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674823705



##########
File path: docs/source/python/ipc.rst
##########
@@ -154,6 +154,73 @@ DataFrame output:
    df = pa.ipc.open_file(buf).read_pandas()
    df[:5]
 
+Efficiently Writing and Reading Arrow Arrays

Review comment:
       Maybe `Efficiently Writing and Reading Data` to keep it more general?
   I mean, when reading in those examples we really don't use record batches and when writing it seems the record batches are the way through which you achieve the efficiency (chunking) more than the _goal_ itself. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] amol- commented on a change in pull request #10266: ARROW-12650: [Doc][Python] Improve documentation regarding dealing with memory mapped files

Posted by GitBox <gi...@apache.org>.

amol- commented on a change in pull request #10266:
URL: https://github.com/apache/arrow/pull/10266#discussion_r674003011



##########
File path: docs/source/python/memory.rst
##########
@@ -277,6 +277,97 @@ types than with normal Python file objects.
    !rm example.dat
    !rm example2.dat
 
+Efficiently Writing and Reading Arrow Arrays
+--------------------------------------------
+
+Being optimized for zero copy and memory mapped data, Arrow allows to easily
+read and write arrays consuming the minimum amount of resident memory.
+
+When writing and reading raw Arrow data, we can use the Arrow File Format
+or the Arrow Streaming Format.
+
+To dump an array to file, you can use the :meth:`~pyarrow.ipc.new_file`
+which will provide a new :class:`~pyarrow.ipc.RecordBatchFileWriter` instance
+that can be used to write batches of data to that file.
+
+For example to write an array of 100M integers, we could write it in 1000 chunks
+of 100000 entries:
+
+.. ipython:: python
+
+    BATCH_SIZE = 100000
+    NUM_BATCHES = 1000
+
+    schema = pa.schema([pa.field('nums', pa.int32())])
+
+    with pa.OSFile('bigfile.arrow', 'wb') as sink:
+        with pa.ipc.new_file(sink, schema) as writer:
+            for row in range(NUM_BATCHES):
+                batch = pa.record_batch([pa.array(range(BATCH_SIZE), type=pa.int32())], schema)
+                writer.write(batch)
+
+record batches support multiple columns, so in practice we always write the
+equivalent of a :class:`~pyarrow.Table`.
+
+Writing in batches is effective because we in theory need to keep in memory only
+the current batch we are writing. But when reading back, we can be even more effective
+by directly mapping the data from disk and avoid allocating any new memory on read.
+
+Under normal conditions, reading back our file will consume a few hundred megabytes
+of memory:
+
+.. ipython:: python
+
+    with pa.OSFile('bigfile.arrow', 'rb') as source:
+        loaded_array = pa.ipc.open_file(source).read_all()
+
+    print("LEN:", len(loaded_array))
+    print("RSS: {}MB".format(pa.total_allocated_bytes() >> 20))
+
+To more efficiently read big data from disk, we can memory map the file, so that
+the arrow can directly reference the data mapped from disk and avoid having to
+allocate its own memory.
+In such case the operating system will be able to page in the mapped memory
+lazily and page it out without any write back cost when under pressure,
+allowing to more easily read arrays bigger than the total memory.

Review comment:
       I rephrased it this way mostly because there were some concerns in previous comments about the usage of the the "avoid memory allocation" wording as the memory is getting allocated anyway, it can just be swapped out at any time without any write back additional cost and thus you can avoid OOMs even if you exhausted memory or swap space.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org