You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Ambalu, Robert" <Ro...@Point72.com> on 2018/05/09 14:31:13 UTC

Question about streaming to memorymapped files

Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?

Thanks in advance
- Rob





DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.




Re: Question about streaming to memorymapped files

Posted by Wes McKinney <we...@gmail.com>.
hi Robert,

Thank you for this analysis. Having a memory map interface that
supports growing the memory map sounds useful, so we would welcome
this contribution to the project.

best
Wes

On Fri, May 11, 2018 at 10:23 AM, Ambalu, Robert
<Ro...@point72.com> wrote:
> Antoine, fair point.  I just ran some perf stats using FileOutputStream vs my growing mmap impl.
> It seems in most cases you are correct, their runtimes are basically equivalent.  The only time mmap beats it significantly is if there are many Flush calls. I have a parameter to control how many rows to buffer before finishing a record batch and writing it out.  Note that my mmap impl currently doubles its size every time its requested to grow
>
> Testing on writing 5 double columns on 10 million rows I get the following:
>
> MMAP:
> BatchSize    Time
> 1                  01:24.849
> 10                00:08.980
> 100              00:02.105
> 1000            00:01.081
> 10000          00:01.101
>
> FILE:
> BatchSize    Time
> 1                  03:13.982
> 10                00:18.875
> 100              00:03.172
> 1000            00:01.137
> 10000          00:01.104
>
> -----Original Message-----
> From: Antoine Pitrou [mailto:antoine@python.org]
> Sent: Friday, May 11, 2018 4:54 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
>
>
> If you write your own auto-growing memory mapped file implementation,
> I'd be curious about performance measurements vs. FileOutputStream (and
> possibly BufferedOutputStream).
>
> mremap() and truncate() calls are not free.  Also, at some point you'll
> want to unmap data already written to prevent the map from growing
> endlessly.
>
> Regards
>
> Antoine.
>
>
> Le 09/05/2018 à 17:55, Ambalu, Robert a écrit :
>> I don’t use the output stream objects directly though right? Just to take a step back a bit, what im trying to do is to generate streaming rows to a table in realtime ( with the ability to control how many rows to batch up before writing out a recordbatch )
>>
>> My understanding is that to properly stream table data I need to:
>> a) create an outputstream instance
>> b) create a RecordBatchStreamWriter binding my strmea object to it
>> c) create a RecordBatchBuilder.  As rows are added, add it to the record batch builder.  When we're ready to flush, call Flust on the batchbuilder to create a record batch and pass the batch to the RecordBatchStreamWriter.
>>
>> I was hoping use MemoryMappedFile for a but since it doesn’t support dynamically growing the mmap file I'll have to write my own impl
>>
>> -----Original Message-----
>> From: Antoine Pitrou [mailto:antoine@python.org]
>> Sent: Wednesday, May 09, 2018 11:42 AM
>> To: dev@arrow.apache.org
>> Subject: Re: Question about streaming to memorymapped files
>>
>>
>> As for buffering data before making a call to write(): in Arrow 0.10.0
>> you'll be able to use BufferedOutputStream for this:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_blob_master_cpp_src_arrow_io_buffered.h&d=DwIDaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=saGHLviPO9fhScNR4CP81xeAZv0qydj6cD5eJs7fZG4&m=JPb2EN-IHSoqJKmEqn-rC7CorVXLSWxcrywaUrMYYzc&s=1E4T4kTw88QvpO9Bk2GiADuArl_rn72Up4EXqHGwCnk&e=
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 09/05/2018 à 17:39, Ambalu, Robert a écrit :
>>> I don’t have any offhand, no, but I would imagine that direct file writes will at some point need to make a system call, which is expensive ( fwrite might buffer before eventually making the sys call, looks like FileOutputStream uses the raw system write for every write call).
>>> The current MMap io interface isn’t usable as a streaming output unfortunately, though I suppose I could just implement my own
>>>
>>> -----Original Message-----
>>> From: Antoine Pitrou [mailto:solipsis@pitrou.net]
>>> Sent: Wednesday, May 09, 2018 11:11 AM
>>> To: dev@arrow.apache.org
>>> Subject: Re: Question about streaming to memorymapped files
>>>
>>>
>>> Do you know of any benchmark numbers / performance studies about this?
>>> While it's true that a memory-mapped file avoids explicit system calls,
>>> I've heard file I/O is quite well optimized, at least on Linux,
>>> nowadays.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> On Wed, 9 May 2018 14:47:53 +0000
>>> "Ambalu, Robert" <Ro...@Point72.com> wrote:
>>>> Antoine, thanks for the quick reply.
>>>> You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
>>>> I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream
>>>>
>>>> -----Original Message-----
>>>> From: Antoine Pitrou [mailto:antoine@python.org]
>>>> Sent: Wednesday, May 09, 2018 10:37 AM
>>>> To: dev@arrow.apache.org
>>>> Subject: Re: Question about streaming to memorymapped files
>>>>
>>>>
>>>> Hi,
>>>>
>>>> If you don't know the output size upfront then should probably use a
>>>> FileOutputStream instead.  By definition, memory mapped files must have
>>>> a fixed size (since they are mapped to a fixed area in virtual memory).
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
>>>>> Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
>>>>> I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
>>>>> Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
>>>>>
>>>>> Thanks in advance
>>>>> - Rob
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
>>>>>
>>>>>
>>>>>
>>>>>
>>>

RE: Question about streaming to memorymapped files

Posted by "Ambalu, Robert" <Ro...@Point72.com>.
Antoine, fair point.  I just ran some perf stats using FileOutputStream vs my growing mmap impl.
It seems in most cases you are correct, their runtimes are basically equivalent.  The only time mmap beats it significantly is if there are many Flush calls. I have a parameter to control how many rows to buffer before finishing a record batch and writing it out.  Note that my mmap impl currently doubles its size every time its requested to grow

Testing on writing 5 double columns on 10 million rows I get the following:

MMAP:
BatchSize    Time
1                  01:24.849
10                00:08.980
100              00:02.105
1000            00:01.081
10000          00:01.101

FILE:
BatchSize    Time
1                  03:13.982
10                00:18.875
100              00:03.172
1000            00:01.137
10000          00:01.104

-----Original Message-----
From: Antoine Pitrou [mailto:antoine@python.org] 
Sent: Friday, May 11, 2018 4:54 AM
To: dev@arrow.apache.org
Subject: Re: Question about streaming to memorymapped files


If you write your own auto-growing memory mapped file implementation,
I'd be curious about performance measurements vs. FileOutputStream (and
possibly BufferedOutputStream).

mremap() and truncate() calls are not free.  Also, at some point you'll
want to unmap data already written to prevent the map from growing
endlessly.

Regards

Antoine.


Le 09/05/2018 à 17:55, Ambalu, Robert a écrit :
> I don’t use the output stream objects directly though right? Just to take a step back a bit, what im trying to do is to generate streaming rows to a table in realtime ( with the ability to control how many rows to batch up before writing out a recordbatch )
> 
> My understanding is that to properly stream table data I need to:
> a) create an outputstream instance
> b) create a RecordBatchStreamWriter binding my strmea object to it
> c) create a RecordBatchBuilder.  As rows are added, add it to the record batch builder.  When we're ready to flush, call Flust on the batchbuilder to create a record batch and pass the batch to the RecordBatchStreamWriter.
> 
> I was hoping use MemoryMappedFile for a but since it doesn’t support dynamically growing the mmap file I'll have to write my own impl
> 
> -----Original Message-----
> From: Antoine Pitrou [mailto:antoine@python.org] 
> Sent: Wednesday, May 09, 2018 11:42 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
> 
> 
> As for buffering data before making a call to write(): in Arrow 0.10.0
> you'll be able to use BufferedOutputStream for this:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_blob_master_cpp_src_arrow_io_buffered.h&d=DwIDaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=saGHLviPO9fhScNR4CP81xeAZv0qydj6cD5eJs7fZG4&m=JPb2EN-IHSoqJKmEqn-rC7CorVXLSWxcrywaUrMYYzc&s=1E4T4kTw88QvpO9Bk2GiADuArl_rn72Up4EXqHGwCnk&e=
> 
> Regards
> 
> Antoine.
> 
> 
> Le 09/05/2018 à 17:39, Ambalu, Robert a écrit :
>> I don’t have any offhand, no, but I would imagine that direct file writes will at some point need to make a system call, which is expensive ( fwrite might buffer before eventually making the sys call, looks like FileOutputStream uses the raw system write for every write call).
>> The current MMap io interface isn’t usable as a streaming output unfortunately, though I suppose I could just implement my own
>>
>> -----Original Message-----
>> From: Antoine Pitrou [mailto:solipsis@pitrou.net] 
>> Sent: Wednesday, May 09, 2018 11:11 AM
>> To: dev@arrow.apache.org
>> Subject: Re: Question about streaming to memorymapped files
>>
>>
>> Do you know of any benchmark numbers / performance studies about this?
>> While it's true that a memory-mapped file avoids explicit system calls,
>> I've heard file I/O is quite well optimized, at least on Linux,
>> nowadays.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Wed, 9 May 2018 14:47:53 +0000
>> "Ambalu, Robert" <Ro...@Point72.com> wrote:
>>> Antoine, thanks for the quick reply.
>>> You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
>>> I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream
>>>
>>> -----Original Message-----
>>> From: Antoine Pitrou [mailto:antoine@python.org] 
>>> Sent: Wednesday, May 09, 2018 10:37 AM
>>> To: dev@arrow.apache.org
>>> Subject: Re: Question about streaming to memorymapped files
>>>
>>>
>>> Hi,
>>>
>>> If you don't know the output size upfront then should probably use a
>>> FileOutputStream instead.  By definition, memory mapped files must have
>>> a fixed size (since they are mapped to a fixed area in virtual memory).
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
>>>> Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
>>>> I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
>>>> Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
>>>>
>>>> Thanks in advance
>>>> - Rob
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
>>>>
>>>>
>>>>
>>>>   
>>

Re: Question about streaming to memorymapped files

Posted by Antoine Pitrou <an...@python.org>.
If you write your own auto-growing memory mapped file implementation,
I'd be curious about performance measurements vs. FileOutputStream (and
possibly BufferedOutputStream).

mremap() and truncate() calls are not free.  Also, at some point you'll
want to unmap data already written to prevent the map from growing
endlessly.

Regards

Antoine.


Le 09/05/2018 à 17:55, Ambalu, Robert a écrit :
> I don’t use the output stream objects directly though right? Just to take a step back a bit, what im trying to do is to generate streaming rows to a table in realtime ( with the ability to control how many rows to batch up before writing out a recordbatch )
> 
> My understanding is that to properly stream table data I need to:
> a) create an outputstream instance
> b) create a RecordBatchStreamWriter binding my strmea object to it
> c) create a RecordBatchBuilder.  As rows are added, add it to the record batch builder.  When we're ready to flush, call Flust on the batchbuilder to create a record batch and pass the batch to the RecordBatchStreamWriter.
> 
> I was hoping use MemoryMappedFile for a but since it doesn’t support dynamically growing the mmap file I'll have to write my own impl
> 
> -----Original Message-----
> From: Antoine Pitrou [mailto:antoine@python.org] 
> Sent: Wednesday, May 09, 2018 11:42 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
> 
> 
> As for buffering data before making a call to write(): in Arrow 0.10.0
> you'll be able to use BufferedOutputStream for this:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_blob_master_cpp_src_arrow_io_buffered.h&d=DwIDaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=saGHLviPO9fhScNR4CP81xeAZv0qydj6cD5eJs7fZG4&m=JPb2EN-IHSoqJKmEqn-rC7CorVXLSWxcrywaUrMYYzc&s=1E4T4kTw88QvpO9Bk2GiADuArl_rn72Up4EXqHGwCnk&e=
> 
> Regards
> 
> Antoine.
> 
> 
> Le 09/05/2018 à 17:39, Ambalu, Robert a écrit :
>> I don’t have any offhand, no, but I would imagine that direct file writes will at some point need to make a system call, which is expensive ( fwrite might buffer before eventually making the sys call, looks like FileOutputStream uses the raw system write for every write call).
>> The current MMap io interface isn’t usable as a streaming output unfortunately, though I suppose I could just implement my own
>>
>> -----Original Message-----
>> From: Antoine Pitrou [mailto:solipsis@pitrou.net] 
>> Sent: Wednesday, May 09, 2018 11:11 AM
>> To: dev@arrow.apache.org
>> Subject: Re: Question about streaming to memorymapped files
>>
>>
>> Do you know of any benchmark numbers / performance studies about this?
>> While it's true that a memory-mapped file avoids explicit system calls,
>> I've heard file I/O is quite well optimized, at least on Linux,
>> nowadays.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> On Wed, 9 May 2018 14:47:53 +0000
>> "Ambalu, Robert" <Ro...@Point72.com> wrote:
>>> Antoine, thanks for the quick reply.
>>> You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
>>> I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream
>>>
>>> -----Original Message-----
>>> From: Antoine Pitrou [mailto:antoine@python.org] 
>>> Sent: Wednesday, May 09, 2018 10:37 AM
>>> To: dev@arrow.apache.org
>>> Subject: Re: Question about streaming to memorymapped files
>>>
>>>
>>> Hi,
>>>
>>> If you don't know the output size upfront then should probably use a
>>> FileOutputStream instead.  By definition, memory mapped files must have
>>> a fixed size (since they are mapped to a fixed area in virtual memory).
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
>>>> Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
>>>> I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
>>>> Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
>>>>
>>>> Thanks in advance
>>>> - Rob
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
>>>>
>>>>
>>>>
>>>>   
>>

RE: Question about streaming to memorymapped files

Posted by "Ambalu, Robert" <Ro...@Point72.com>.
I don’t use the output stream objects directly though right? Just to take a step back a bit, what im trying to do is to generate streaming rows to a table in realtime ( with the ability to control how many rows to batch up before writing out a recordbatch )

My understanding is that to properly stream table data I need to:
a) create an outputstream instance
b) create a RecordBatchStreamWriter binding my strmea object to it
c) create a RecordBatchBuilder.  As rows are added, add it to the record batch builder.  When we're ready to flush, call Flust on the batchbuilder to create a record batch and pass the batch to the RecordBatchStreamWriter.

I was hoping use MemoryMappedFile for a but since it doesn’t support dynamically growing the mmap file I'll have to write my own impl

-----Original Message-----
From: Antoine Pitrou [mailto:antoine@python.org] 
Sent: Wednesday, May 09, 2018 11:42 AM
To: dev@arrow.apache.org
Subject: Re: Question about streaming to memorymapped files


As for buffering data before making a call to write(): in Arrow 0.10.0
you'll be able to use BufferedOutputStream for this:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_arrow_blob_master_cpp_src_arrow_io_buffered.h&d=DwIDaQ&c=f5Q7ov8zryUUIGT55zpGgw&r=saGHLviPO9fhScNR4CP81xeAZv0qydj6cD5eJs7fZG4&m=JPb2EN-IHSoqJKmEqn-rC7CorVXLSWxcrywaUrMYYzc&s=1E4T4kTw88QvpO9Bk2GiADuArl_rn72Up4EXqHGwCnk&e=

Regards

Antoine.


Le 09/05/2018 à 17:39, Ambalu, Robert a écrit :
> I don’t have any offhand, no, but I would imagine that direct file writes will at some point need to make a system call, which is expensive ( fwrite might buffer before eventually making the sys call, looks like FileOutputStream uses the raw system write for every write call).
> The current MMap io interface isn’t usable as a streaming output unfortunately, though I suppose I could just implement my own
> 
> -----Original Message-----
> From: Antoine Pitrou [mailto:solipsis@pitrou.net] 
> Sent: Wednesday, May 09, 2018 11:11 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
> 
> 
> Do you know of any benchmark numbers / performance studies about this?
> While it's true that a memory-mapped file avoids explicit system calls,
> I've heard file I/O is quite well optimized, at least on Linux,
> nowadays.
> 
> Regards
> 
> Antoine.
> 
> 
> On Wed, 9 May 2018 14:47:53 +0000
> "Ambalu, Robert" <Ro...@Point72.com> wrote:
>> Antoine, thanks for the quick reply.
>> You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
>> I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream
>>
>> -----Original Message-----
>> From: Antoine Pitrou [mailto:antoine@python.org] 
>> Sent: Wednesday, May 09, 2018 10:37 AM
>> To: dev@arrow.apache.org
>> Subject: Re: Question about streaming to memorymapped files
>>
>>
>> Hi,
>>
>> If you don't know the output size upfront then should probably use a
>> FileOutputStream instead.  By definition, memory mapped files must have
>> a fixed size (since they are mapped to a fixed area in virtual memory).
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
>>> Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
>>> I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
>>> Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
>>>
>>> Thanks in advance
>>> - Rob
>>>
>>>
>>>
>>>
>>>
>>> DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
>>>
>>>
>>>
>>>   
> 

Re: Question about streaming to memorymapped files

Posted by Antoine Pitrou <an...@python.org>.
As for buffering data before making a call to write(): in Arrow 0.10.0
you'll be able to use BufferedOutputStream for this:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/buffered.h

Regards

Antoine.


Le 09/05/2018 à 17:39, Ambalu, Robert a écrit :
> I don’t have any offhand, no, but I would imagine that direct file writes will at some point need to make a system call, which is expensive ( fwrite might buffer before eventually making the sys call, looks like FileOutputStream uses the raw system write for every write call).
> The current MMap io interface isn’t usable as a streaming output unfortunately, though I suppose I could just implement my own
> 
> -----Original Message-----
> From: Antoine Pitrou [mailto:solipsis@pitrou.net] 
> Sent: Wednesday, May 09, 2018 11:11 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
> 
> 
> Do you know of any benchmark numbers / performance studies about this?
> While it's true that a memory-mapped file avoids explicit system calls,
> I've heard file I/O is quite well optimized, at least on Linux,
> nowadays.
> 
> Regards
> 
> Antoine.
> 
> 
> On Wed, 9 May 2018 14:47:53 +0000
> "Ambalu, Robert" <Ro...@Point72.com> wrote:
>> Antoine, thanks for the quick reply.
>> You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
>> I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream
>>
>> -----Original Message-----
>> From: Antoine Pitrou [mailto:antoine@python.org] 
>> Sent: Wednesday, May 09, 2018 10:37 AM
>> To: dev@arrow.apache.org
>> Subject: Re: Question about streaming to memorymapped files
>>
>>
>> Hi,
>>
>> If you don't know the output size upfront then should probably use a
>> FileOutputStream instead.  By definition, memory mapped files must have
>> a fixed size (since they are mapped to a fixed area in virtual memory).
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
>>> Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
>>> I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
>>> Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
>>>
>>> Thanks in advance
>>> - Rob
>>>
>>>
>>>
>>>
>>>
>>> DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
>>>
>>>
>>>
>>>   
> 

RE: Question about streaming to memorymapped files

Posted by "Ambalu, Robert" <Ro...@Point72.com>.
I don’t have any offhand, no, but I would imagine that direct file writes will at some point need to make a system call, which is expensive ( fwrite might buffer before eventually making the sys call, looks like FileOutputStream uses the raw system write for every write call).
The current MMap io interface isn’t usable as a streaming output unfortunately, though I suppose I could just implement my own

-----Original Message-----
From: Antoine Pitrou [mailto:solipsis@pitrou.net] 
Sent: Wednesday, May 09, 2018 11:11 AM
To: dev@arrow.apache.org
Subject: Re: Question about streaming to memorymapped files


Do you know of any benchmark numbers / performance studies about this?
While it's true that a memory-mapped file avoids explicit system calls,
I've heard file I/O is quite well optimized, at least on Linux,
nowadays.

Regards

Antoine.


On Wed, 9 May 2018 14:47:53 +0000
"Ambalu, Robert" <Ro...@Point72.com> wrote:
> Antoine, thanks for the quick reply.
> You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
> I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream
> 
> -----Original Message-----
> From: Antoine Pitrou [mailto:antoine@python.org] 
> Sent: Wednesday, May 09, 2018 10:37 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
> 
> 
> Hi,
> 
> If you don't know the output size upfront then should probably use a
> FileOutputStream instead.  By definition, memory mapped files must have
> a fixed size (since they are mapped to a fixed area in virtual memory).
> 
> Regards
> 
> Antoine.
> 
> 
> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
> > Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
> > I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
> > Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
> > 
> > Thanks in advance
> > - Rob
> > 
> > 
> > 
> > 
> > 
> > DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
> > 
> > 
> > 
> >   


Re: Question about streaming to memorymapped files

Posted by Antoine Pitrou <so...@pitrou.net>.
Do you know of any benchmark numbers / performance studies about this?
While it's true that a memory-mapped file avoids explicit system calls,
I've heard file I/O is quite well optimized, at least on Linux,
nowadays.

Regards

Antoine.


On Wed, 9 May 2018 14:47:53 +0000
"Ambalu, Robert" <Ro...@Point72.com> wrote:
> Antoine, thanks for the quick reply.
> You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
> I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream
> 
> -----Original Message-----
> From: Antoine Pitrou [mailto:antoine@python.org] 
> Sent: Wednesday, May 09, 2018 10:37 AM
> To: dev@arrow.apache.org
> Subject: Re: Question about streaming to memorymapped files
> 
> 
> Hi,
> 
> If you don't know the output size upfront then should probably use a
> FileOutputStream instead.  By definition, memory mapped files must have
> a fixed size (since they are mapped to a fixed area in virtual memory).
> 
> Regards
> 
> Antoine.
> 
> 
> Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
> > Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
> > I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
> > Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
> > 
> > Thanks in advance
> > - Rob
> > 
> > 
> > 
> > 
> > 
> > DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
> > 
> > 
> > 
> >   


RE: Question about streaming to memorymapped files

Posted by "Ambalu, Robert" <Ro...@Point72.com>.
Antoine, thanks for the quick reply.
You can actually grow memorymapped files with a mremap call ( and I think a seek/write on the file ), I do this in my applications and it works fine.
I want the efficiency of writing via memory maps, so would prefer to avoid FileOutputStream

-----Original Message-----
From: Antoine Pitrou [mailto:antoine@python.org] 
Sent: Wednesday, May 09, 2018 10:37 AM
To: dev@arrow.apache.org
Subject: Re: Question about streaming to memorymapped files


Hi,

If you don't know the output size upfront then should probably use a
FileOutputStream instead.  By definition, memory mapped files must have
a fixed size (since they are mapped to a fixed area in virtual memory).

Regards

Antoine.


Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
> Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
> I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
> Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
> 
> Thanks in advance
> - Rob
> 
> 
> 
> 
> 
> DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
> 
> 
> 
> 

Re: Question about streaming to memorymapped files

Posted by Antoine Pitrou <an...@python.org>.
Hi,

If you don't know the output size upfront then should probably use a
FileOutputStream instead.  By definition, memory mapped files must have
a fixed size (since they are mapped to a fixed area in virtual memory).

Regards

Antoine.


Le 09/05/2018 à 16:31, Ambalu, Robert a écrit :
> Hey, I'm looking into streaming table updates into a memory mapped file ( C++ )
> I think I have everything I need ( MemoryMappedFile output streamer, RecordBatchStreamWriter ) but I don't understand how to properly create the memmap file.  It looks like it requires you to preset a size to the file when you create it, but since ill be streaming I don't actually know how big a file im going to need...
> Am I missing some other API point here?  Any reason why size is required up front and the memmap doesn't auto-grow as needed?
> 
> Thanks in advance
> - Rob
> 
> 
> 
> 
> 
> DISCLAIMER: This e-mail message and any attachments are intended solely for the use of the individual or entity to which it is addressed and may contain information that is confidential or legally privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, copying or other use of this message or its attachments is strictly prohibited. If you have received this message in error, please notify the sender immediately and permanently delete this message and any attachments.
> 
> 
> 
>