You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Rares Vernica <rv...@gmail.com> on 2020/12/14 05:52:55 UTC

Python: Bad address when rewriting file

Hello,

As part of a test, I'm reading a record batch from an Arrow file,
re-batching the data in smaller batches, and writing back the result to the
same file. I'm getting an unexpected Bad address error and I wonder what am
I doing wrong?

reader = pyarrow.open_stream(fn)
tbl = reader.read_all()

writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
batches = tbl.to_batches(max_chunksize=200)
writer.write_table(pyarrow.Table.from_batches(batches))
writer.close()

Traceback (most recent call last):
  File "tests/foo.py", line 10, in <module>
    writer.write_table(pyarrow.Table.from_batches(batches))
  File "pyarrow/ipc.pxi", line 237, in
pyarrow.lib._CRecordBatchWriter.write_table
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
address

Do I need to "close" the reader or open the writer differently?

I'm using PyArrow 0.16.0 and Python 3.8.2.

Thank you!
Rares

Re: Python: Bad address when rewriting file

Posted by Antoine Pitrou <an...@python.org>.
Hi Rares,

Ok, so here is the explanation.  `pa.ipc.open_stream` will open the
given file memory-mapped, so the buffers read from the file are
zero-copy. But now you're rewriting the file from scratch... so the
buffers become invalid memory (they're zero-copy).  Hence the "Bad
address" error you're getting (the underlying errno mnemonic for error
code 14 is EFAULT).

If you need to rewrite the *same* file, you should disable memory
mapping.  For example, you can use
`pyarrow.ipc.open_stream(pyarrow.OSFile(fn))`, which will create a
regular file object.

Or you can arrange to not rewrite the same file.  For example you could
write to a temporary file, close it, and then move it to the original
location.

Regards

Antoine.


Le 14/12/2020 à 20:03, Rares Vernica a écrit :
> Hi Antoine,
> 
> Here is a repro for this issue:
> 
> import pyarrow
> 
> fn = '/tmp/foo'
> 
> # Data
> data = [
>     pyarrow.array(range(1000)),
>     pyarrow.array(range(1000))
> ]
> batch = pyarrow.record_batch(data, names=['f0', 'f1'])
> 
> # File Prep
> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, batch.schema)
> writer.write_batch(batch)
> writer.close()
> 
> # Read
> reader = pyarrow.open_stream(fn)
> tbl = reader.read_all()
> 
> # Rewrite
> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
> batches = tbl.to_batches(max_chunksize=200)
> writer.write_table(pyarrow.Table.from_batches(batches))
> writer.close()
> 
> 
>> python3 foo.py
> Traceback (most recent call last):
>   File "foo.py", line 24, in <module>
>     writer.write_table(pyarrow.Table.from_batches(batches))
>   File "pyarrow/ipc.pxi", line 237, in
> pyarrow.lib._CRecordBatchWriter.write_table
>   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
> address
> 
> Cheers,
> Rares
> 
> 
> On Mon, Dec 14, 2020 at 12:30 AM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> Hello Rares,
>>
>> Is there a complete reproducer that we may try out?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 14/12/2020 à 06:52, Rares Vernica a écrit :
>>> Hello,
>>>
>>> As part of a test, I'm reading a record batch from an Arrow file,
>>> re-batching the data in smaller batches, and writing back the result to
>> the
>>> same file. I'm getting an unexpected Bad address error and I wonder what
>> am
>>> I doing wrong?
>>>
>>> reader = pyarrow.open_stream(fn)
>>> tbl = reader.read_all()
>>>
>>> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
>>> batches = tbl.to_batches(max_chunksize=200)
>>> writer.write_table(pyarrow.Table.from_batches(batches))
>>> writer.close()
>>>
>>> Traceback (most recent call last):
>>>   File "tests/foo.py", line 10, in <module>
>>>     writer.write_table(pyarrow.Table.from_batches(batches))
>>>   File "pyarrow/ipc.pxi", line 237, in
>>> pyarrow.lib._CRecordBatchWriter.write_table
>>>   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
>>> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
>>> address
>>>
>>> Do I need to "close" the reader or open the writer differently?
>>>
>>> I'm using PyArrow 0.16.0 and Python 3.8.2.
>>>
>>> Thank you!
>>> Rares
>>>
>>
> 

Re: Python: Bad address when rewriting file

Posted by Rares Vernica <rv...@gmail.com>.
Hi Antoine,

Here is a repro for this issue:

import pyarrow

fn = '/tmp/foo'

# Data
data = [
    pyarrow.array(range(1000)),
    pyarrow.array(range(1000))
]
batch = pyarrow.record_batch(data, names=['f0', 'f1'])

# File Prep
writer = pyarrow.ipc.RecordBatchStreamWriter(fn, batch.schema)
writer.write_batch(batch)
writer.close()

# Read
reader = pyarrow.open_stream(fn)
tbl = reader.read_all()

# Rewrite
writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
batches = tbl.to_batches(max_chunksize=200)
writer.write_table(pyarrow.Table.from_batches(batches))
writer.close()


> python3 foo.py
Traceback (most recent call last):
  File "foo.py", line 24, in <module>
    writer.write_table(pyarrow.Table.from_batches(batches))
  File "pyarrow/ipc.pxi", line 237, in
pyarrow.lib._CRecordBatchWriter.write_table
  File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
address

Cheers,
Rares


On Mon, Dec 14, 2020 at 12:30 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hello Rares,
>
> Is there a complete reproducer that we may try out?
>
> Regards
>
> Antoine.
>
>
> Le 14/12/2020 à 06:52, Rares Vernica a écrit :
> > Hello,
> >
> > As part of a test, I'm reading a record batch from an Arrow file,
> > re-batching the data in smaller batches, and writing back the result to
> the
> > same file. I'm getting an unexpected Bad address error and I wonder what
> am
> > I doing wrong?
> >
> > reader = pyarrow.open_stream(fn)
> > tbl = reader.read_all()
> >
> > writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
> > batches = tbl.to_batches(max_chunksize=200)
> > writer.write_table(pyarrow.Table.from_batches(batches))
> > writer.close()
> >
> > Traceback (most recent call last):
> >   File "tests/foo.py", line 10, in <module>
> >     writer.write_table(pyarrow.Table.from_batches(batches))
> >   File "pyarrow/ipc.pxi", line 237, in
> > pyarrow.lib._CRecordBatchWriter.write_table
> >   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> > OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
> > address
> >
> > Do I need to "close" the reader or open the writer differently?
> >
> > I'm using PyArrow 0.16.0 and Python 3.8.2.
> >
> > Thank you!
> > Rares
> >
>

Re: Python: Bad address when rewriting file

Posted by Antoine Pitrou <an...@python.org>.
Hello Rares,

Is there a complete reproducer that we may try out?

Regards

Antoine.


Le 14/12/2020 à 06:52, Rares Vernica a écrit :
> Hello,
> 
> As part of a test, I'm reading a record batch from an Arrow file,
> re-batching the data in smaller batches, and writing back the result to the
> same file. I'm getting an unexpected Bad address error and I wonder what am
> I doing wrong?
> 
> reader = pyarrow.open_stream(fn)
> tbl = reader.read_all()
> 
> writer = pyarrow.ipc.RecordBatchStreamWriter(fn, tbl.schema)
> batches = tbl.to_batches(max_chunksize=200)
> writer.write_table(pyarrow.Table.from_batches(batches))
> writer.close()
> 
> Traceback (most recent call last):
>   File "tests/foo.py", line 10, in <module>
>     writer.write_table(pyarrow.Table.from_batches(batches))
>   File "pyarrow/ipc.pxi", line 237, in
> pyarrow.lib._CRecordBatchWriter.write_table
>   File "pyarrow/error.pxi", line 97, in pyarrow.lib.check_status
> OSError: [Errno 14] Error writing bytes to file. Detail: [errno 14] Bad
> address
> 
> Do I need to "close" the reader or open the writer differently?
> 
> I'm using PyArrow 0.16.0 and Python 3.8.2.
> 
> Thank you!
> Rares
>