You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Rares Vernica <rv...@gmail.com> on 2020/12/14 05:52:55 UTC
Python: Bad address when rewriting file
Hello,
As part of a test, I'm reading a record batch from an Arrow file,
re-batching the data in smaller batches, and writing back the result to the
same file. I'm getting an unexpected Bad address error and I wonder what am
I doing wrong?
reader = pyarrow.open_stream(fn)
tbl = reader.read_all()
writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
batches = tbl.to_batches(max_chunksize=200)
writer.write_table(pyarrow.Table.from_batches(batches))
writer.close()
Traceback (most recent call last):
File "tests/foo.py", line 10, in <module>
writer.write_table(pyarrow.Table.from_batches(batches))
File "pyarrow/ipc.pxi", line 237, in
pyarrow.lib._CRecordBatchWriter.write_table
File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
address
Do I need to "close" the reader or open the writer differently?
I'm using PyArrow 0.16.0 and Python 3.8.2.
Thank you!
Rares
Re: Python: Bad address when rewriting file
Posted by Antoine Pitrou <an...@python.org>.
Hi Rares,
Ok, so here is the explanation. `pa.ipc.open_stream` will open the
given file memory-mapped, so the buffers read from the file are
zero-copy. But now you're rewriting the file from scratch... so the
buffers become invalid memory (they're zero-copy). Hence the "Bad
address" error you're getting (the underlying errno mnemonic for error
code 14 is EFAULT).
If you need to rewrite the *same* file, you should disable memory
mapping. For example, you can use
`pyarrow.ipc.open_stream(pyarrow.OSFile(fn))`, which will create a
regular file object.
Or you can arrange to not rewrite the same file. For example you could
write to a temporary file, close it, and then move it to the original
location.
Regards
Antoine.
Le 14/12/2020 à 20:03, Rares Vernica a écrit :
> Hi Antoine,
>
> Here is a repro for this issue:
>
> import pyarrow
>
> fn = '/tmp/foo'
>
> # Data
> data = [
> pyarrow.array(range(1000)),
> pyarrow.array(range(1000))
> ]
> batch = pyarrow.record_batch(data, names=['f0', 'f1'])
>
> # File Prep
> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, batch.schema)
> writer.write_batch(batch)
> writer.close()
>
> # Read
> reader = pyarrow.open_stream(fn)
> tbl = reader.read_all()
>
> # Rewrite
> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
> batches = tbl.to_batches(max_chunksize=200)
> writer.write_table(pyarrow.Table.from_batches(batches))
> writer.close()
>
>
>> python3 foo.py
> Traceback (most recent call last):
> File "foo.py", line 24, in <module>
> writer.write_table(pyarrow.Table.from_batches(batches))
> File "pyarrow/ipc.pxi", line 237, in
> pyarrow.lib._CRecordBatchWriter.write_table
> File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
> address
>
> Cheers,
> Rares
>
>
> On Mon, Dec 14, 2020 at 12:30 AM Antoine Pitrou <an...@python.org> wrote:
>
>>
>> Hello Rares,
>>
>> Is there a complete reproducer that we may try out?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 14/12/2020 à 06:52, Rares Vernica a écrit :
>>> Hello,
>>>
>>> As part of a test, I'm reading a record batch from an Arrow file,
>>> re-batching the data in smaller batches, and writing back the result to
>> the
>>> same file. I'm getting an unexpected Bad address error and I wonder what
>> am
>>> I doing wrong?
>>>
>>> reader = pyarrow.open_stream(fn)
>>> tbl = reader.read_all()
>>>
>>> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
>>> batches = tbl.to_batches(max_chunksize=200)
>>> writer.write_table(pyarrow.Table.from_batches(batches))
>>> writer.close()
>>>
>>> Traceback (most recent call last):
>>> File "tests/foo.py", line 10, in <module>
>>> writer.write_table(pyarrow.Table.from_batches(batches))
>>> File "pyarrow/ipc.pxi", line 237, in
>>> pyarrow.lib._CRecordBatchWriter.write_table
>>> File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
>>> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
>>> address
>>>
>>> Do I need to "close" the reader or open the writer differently?
>>>
>>> I'm using PyArrow 0.16.0 and Python 3.8.2.
>>>
>>> Thank you!
>>> Rares
>>>
>>
>
Re: Python: Bad address when rewriting file
Posted by Rares Vernica <rv...@gmail.com>.
Hi Antoine,
Here is a repro for this issue:
import pyarrow
fn = '/tmp/foo'
# Data
data = [
pyarrow.array(range(1000)),
pyarrow.array(range(1000))
]
batch = pyarrow.record_batch(data, names=['f0', 'f1'])
# File Prep
writer = pyarrow.ipc.RecordBatchStreamWriter(fn, batch.schema)
writer.write_batch(batch)
writer.close()
# Read
reader = pyarrow.open_stream(fn)
tbl = reader.read_all()
# Rewrite
writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
batches = tbl.to_batches(max_chunksize=200)
writer.write_table(pyarrow.Table.from_batches(batches))
writer.close()
> python3 foo.py
Traceback (most recent call last):
File "foo.py", line 24, in <module>
writer.write_table(pyarrow.Table.from_batches(batches))
File "pyarrow/ipc.pxi", line 237, in
pyarrow.lib._CRecordBatchWriter.write_table
File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
address
Cheers,
Rares
On Mon, Dec 14, 2020 at 12:30 AM Antoine Pitrou <an...@python.org> wrote:
>
> Hello Rares,
>
> Is there a complete reproducer that we may try out?
>
> Regards
>
> Antoine.
>
>
> Le 14/12/2020 à 06:52, Rares Vernica a écrit :
> > Hello,
> >
> > As part of a test, I'm reading a record batch from an Arrow file,
> > re-batching the data in smaller batches, and writing back the result to
> the
> > same file. I'm getting an unexpected Bad address error and I wonder what
> am
> > I doing wrong?
> >
> > reader = pyarrow.open_stream(fn)
> > tbl = reader.read_all()
> >
> > writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
> > batches = tbl.to_batches(max_chunksize=200)
> > writer.write_table(pyarrow.Table.from_batches(batches))
> > writer.close()
> >
> > Traceback (most recent call last):
> > File "tests/foo.py", line 10, in <module>
> > writer.write_table(pyarrow.Table.from_batches(batches))
> > File "pyarrow/ipc.pxi", line 237, in
> > pyarrow.lib._CRecordBatchWriter.write_table
> > File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> > OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
> > address
> >
> > Do I need to "close" the reader or open the writer differently?
> >
> > I'm using PyArrow 0.16.0 and Python 3.8.2.
> >
> > Thank you!
> > Rares
> >
>
Re: Python: Bad address when rewriting file
Posted by Antoine Pitrou <an...@python.org>.
Hello Rares,
Is there a complete reproducer that we may try out?
Regards
Antoine.
Le 14/12/2020 à 06:52, Rares Vernica a écrit :
> Hello,
>
> As part of a test, I'm reading a record batch from an Arrow file,
> re-batching the data in smaller batches, and writing back the result to the
> same file. I'm getting an unexpected Bad address error and I wonder what am
> I doing wrong?
>
> reader = pyarrow.open_stream(fn)
> tbl = reader.read_all()
>
> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
> batches = tbl.to_batches(max_chunksize=200)
> writer.write_table(pyarrow.Table.from_batches(batches))
> writer.close()
>
> Traceback (most recent call last):
> File "tests/foo.py", line 10, in <module>
> writer.write_table(pyarrow.Table.from_batches(batches))
> File "pyarrow/ipc.pxi", line 237, in
> pyarrow.lib._CRecordBatchWriter.write_table
> File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
> address
>
> Do I need to "close" the reader or open the writer differently?
>
> I'm using PyArrow 0.16.0 and Python 3.8.2.
>
> Thank you!
> Rares
>