You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/07 08:19:28 UTC
[GitHub] [arrow] stefan-lange-dataeng opened a new issue #8607: Deletion of existing file when write_table fails
stefan-lange-dataeng opened a new issue #8607:
URL: https://github.com/apache/arrow/issues/8607
https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737
When write_table encounters a problem, the exception handler removes the attempted output parquet file (see snippet below).
This logic makes sense in order to make sure no file with inconsistent content/state remains.
However, if a file with the same name already exists, it gets also deleted.
Would it make sense to add an option to let the user choose the behaviour in such a case, e. g. to choose to keep an existing file and to only overwrite it if the action is successful?
And/or: Would it make sense to check early if the intended file can be written and fail early if that is not the case (without deleting a preexisting file)?
E. g. if the directory has permission 755 and the already existing file has permission 444, then the write attempt fails with a PermissionError but the exception handler deletes the preexisting file. This behaviour is a bit counterintuitive?
Or would you say the responsibility lies with the people setting the file/directory permissions right?
except Exception:
if _is_path_like(where):
try:
os.remove(_stringify_path(where))
except os.error:
pass
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] stefan-lange-dataeng commented on issue #8607: Deletion of existing file when write_table fails
Posted by GitBox <gi...@apache.org>.
stefan-lange-dataeng commented on issue #8607:
URL: https://github.com/apache/arrow/issues/8607#issuecomment-727846447
Thanks, I have created https://issues.apache.org/jira/browse/ARROW-10611.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on issue #8607: Deletion of existing file when write_table fails
Posted by GitBox <gi...@apache.org>.
wesm commented on issue #8607:
URL: https://github.com/apache/arrow/issues/8607#issuecomment-727649809
Can you please open some Jira issues if there's something to fix or improve in pyarrow?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm closed issue #8607: Deletion of existing file when write_table fails
Posted by GitBox <gi...@apache.org>.
wesm closed issue #8607:
URL: https://github.com/apache/arrow/issues/8607
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] chr1st1ank commented on issue #8607: Deletion of existing file when write_table fails
Posted by GitBox <gi...@apache.org>.
chr1st1ank commented on issue #8607:
URL: https://github.com/apache/arrow/issues/8607#issuecomment-726609910
This can be reproduced with the following commands in ipython.
In effect the attempt to write to a file without write permissions to it results in the deletion of this file (of course only if the user has permissions enough on the directory to delete the file).
```
>> import pandas as pd
>>
>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>> df.to_parquet('df.parquet.gzip', compression='gzip')
>> pd.read_parquet('df.parquet.gzip')
>> !ls -l 'df.parquet.gzip'
-rw-r--r-- 1 myuser domain users 1529 Nov 13 09:31 df.parquet.gzip
>> !chmod 000 'df.parquet.gzip'
>> df.to_parquet('df.parquet.gzip', compression='gzip')
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
<ipython-input-10-584c5c8752e0> in <module>
----> 1 df.to_parquet('df.parquet.gzip', compression='gzip')
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
2220 index=index,
2221 partition_cols=partition_cols,
-> 2222 **kwargs
2223 )
2224
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
252 index=index,
253 partition_cols=partition_cols,
--> 254 **kwargs
255 )
256
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
115 compression=compression,
116 coerce_timestamps=coerce_timestamps,
--> 117 **kwargs
118 )
119
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, flavor, **kwargs)
1122 compression=compression,
1123 use_deprecated_int96_timestamps=use_int96,
-> 1124 **kwargs) as writer:
1125 writer.write_table(table, row_group_size=row_group_size)
1126 except Exception:
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, where, schema, flavor, version, use_dictionary, compression, use_deprecated_int96_timestamps, **options)
338 if _is_path_like(where):
339 fs = _get_fs_from_path(where)
--> 340 sink = self.file_handle = fs.open(where, 'wb')
341 else:
342 sink = where
~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/filesystem.py in open(self, path, mode)
243 """
244 path = _stringify_path(path)
--> 245 return open(path, mode=mode)
246
247 @property
PermissionError: [Errno 13] Permission denied: 'df.parquet.gzip'
>> !ls -l 'df.parquet.gzip'
ls: cannot access df.parquet.gzip: No such file or directory
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org