You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/07 08:19:28 UTC

[GitHub] [arrow] stefan-lange-dataeng opened a new issue #8607: Deletion of existing file when write_table fails

stefan-lange-dataeng opened a new issue #8607:
URL: https://github.com/apache/arrow/issues/8607


   https://github.com/apache/arrow/blob/47f2e0cb03ed8ad265e0688ada8162bf46066483/python/pyarrow/parquet.py#L1737
   
   When write_table encounters a problem, the exception handler removes the attempted output parquet file (see snippet below).
   This logic makes sense in order to make sure no file with inconsistent content/state remains.
   However, if a file with the same name already exists, it gets also deleted.
   
   Would it make sense to add an option to let the user choose the behaviour in such a case, e. g. to choose to keep an existing file and to only overwrite it if the action is successful?
   And/or: Would it make sense to check early if the intended file can be written and fail early if that is not the case (without deleting a preexisting file)?
   E. g. if the directory has permission 755 and the already existing file has permission 444, then the write attempt fails with a PermissionError but the exception handler deletes the preexisting file. This behaviour is a bit counterintuitive?
   Or would you say the responsibility lies with the people setting the file/directory permissions right?
   
   except Exception:
           if _is_path_like(where):
               try:
                   os.remove(_stringify_path(where))
               except os.error:
                   pass


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] stefan-lange-dataeng commented on issue #8607: Deletion of existing file when write_table fails

Posted by GitBox <gi...@apache.org>.
stefan-lange-dataeng commented on issue #8607:
URL: https://github.com/apache/arrow/issues/8607#issuecomment-727846447


   Thanks, I have created https://issues.apache.org/jira/browse/ARROW-10611.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on issue #8607: Deletion of existing file when write_table fails

Posted by GitBox <gi...@apache.org>.
wesm commented on issue #8607:
URL: https://github.com/apache/arrow/issues/8607#issuecomment-727649809


   Can you please open some Jira issues if there's something to fix or improve in pyarrow? 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed issue #8607: Deletion of existing file when write_table fails

Posted by GitBox <gi...@apache.org>.
wesm closed issue #8607:
URL: https://github.com/apache/arrow/issues/8607


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] chr1st1ank commented on issue #8607: Deletion of existing file when write_table fails

Posted by GitBox <gi...@apache.org>.
chr1st1ank commented on issue #8607:
URL: https://github.com/apache/arrow/issues/8607#issuecomment-726609910


   This can be reproduced with the following commands in ipython.
   In effect the attempt to write to a file without write permissions to it results in the deletion of this file (of course only if the user has permissions enough on the directory to delete the file).
   
   ```
   >> import pandas as pd
   >> 
   >> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
   >> df.to_parquet('df.parquet.gzip', compression='gzip')
   >> pd.read_parquet('df.parquet.gzip')
   >> !ls -l 'df.parquet.gzip'
   
   -rw-r--r-- 1 myuser domain users 1529 Nov 13 09:31 df.parquet.gzip
   
   
   >> !chmod 000 'df.parquet.gzip'
   >> df.to_parquet('df.parquet.gzip', compression='gzip')
   
       ---------------------------------------------------------------------------
       
       PermissionError                           Traceback (most recent call last)
       
       <ipython-input-10-584c5c8752e0> in <module>
       ----> 1 df.to_parquet('df.parquet.gzip', compression='gzip')
   
   
       ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
          2220             index=index,
          2221             partition_cols=partition_cols,
       -> 2222             **kwargs
          2223         )
          2224 
   
   
       ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
           252         index=index,
           253         partition_cols=partition_cols,
       --> 254         **kwargs
           255     )
           256 
   
   
       ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
           115                 compression=compression,
           116                 coerce_timestamps=coerce_timestamps,
       --> 117                 **kwargs
           118             )
           119 
   
   
       ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, flavor, **kwargs)
          1122                 compression=compression,
          1123                 use_deprecated_int96_timestamps=use_int96,
       -> 1124                 **kwargs) as writer:
          1125             writer.write_table(table, row_group_size=row_group_size)
          1126     except Exception:
   
   
       ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/parquet.py in __init__(self, where, schema, flavor, version, use_dictionary, compression, use_deprecated_int96_timestamps, **options)
           338         if _is_path_like(where):
           339             fs = _get_fs_from_path(where)
       --> 340             sink = self.file_handle = fs.open(where, 'wb')
           341         else:
           342             sink = where
   
   
       ~/.conda/envs/pct-dev/lib/python3.6/site-packages/pyarrow/filesystem.py in open(self, path, mode)
           243         """
           244         path = _stringify_path(path)
       --> 245         return open(path, mode=mode)
           246 
           247     @property
   
   
       PermissionError: [Errno 13] Permission denied: 'df.parquet.gzip'
   
   
   
   >> !ls -l 'df.parquet.gzip'
       ls: cannot access df.parquet.gzip: No such file or directory
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org