You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/14 16:04:40 UTC

[GitHub] [arrow] patrickpai opened a new pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

patrickpai opened a new pull request #7757:
URL: https://github.com/apache/arrow/pull/7757


   Due to ongoing LZ4 problems with Parquet files, this patch disables writing files with LZ4 codec by throwing a `ParquetException`.
   
   Mailing list discussion: https://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAJPUwMCM4ZaJB720%2BuoM1aSA2oD9jSEnzuwWjJiw6vwXxHk7nw%40mail.gmail.com%3E
   
   Jira ticket: https://issues.apache.org/jira/browse/ARROW-9424


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658393753


   Ah I see that you're adding Python changes.  I fixed the lint problems here so be sure to rebase your changes


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658438428


   I don't recall but that may have been the case. Either way it's a giant mess since many people use pyarrow to write Parquet files to be consumed by JVM-based systems. I think we can infer that LZ4 is not often used from the fact that we haven't had more bug reports about it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658429763


   @patrickpai do you anticipate to complete this today? We are hoping to cut a release candidate tomorrow during the workday Central Europe Time so I can help finish this if needed


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658438215


   In any case, the format used by Hadoop is neither of both, it's LZ4_RAW with a custom header...


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658266555


   https://issues.apache.org/jira/browse/ARROW-9424


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] patrickpai commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
patrickpai commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658430846


   @wesm I think it's best if I get help on this. I'm totally new to the python codebase and wasn't expecting to finish today. I can let you take over.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658468591


   OK, writing is disabled but old files can still be read
   
   ```
   n [2]: pq.write_table(table, 'not_allowed.parquet.lz4', compression='lz4')                                                                                                                    
   ---------------------------------------------------------------------------
   OSError                                   Traceback (most recent call last)
   <ipython-input-2-597ef4749b0a> in <module>
   ----> 1 pq.write_table(table, 'not_allowed.parquet.lz4', compression='lz4')
   
   ~/code/arrow/python/pyarrow/parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, compression_level, use_byte_stream_split, data_page_version, **kwargs)
      1632                 data_page_version=data_page_version,
      1633                 **kwargs) as writer:
   -> 1634             writer.write_table(table, row_group_size=row_group_size)
      1635     except Exception:
      1636         if _is_path_like(where):
   
   ~/code/arrow/python/pyarrow/parquet.py in write_table(self, table, row_group_size)
       586             raise ValueError(msg)
       587 
   --> 588         self.writer.write_table(table, row_group_size=row_group_size)
       589 
       590     def close(self):
   
   ~/code/arrow/python/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetWriter.write_table()
      1406 
      1407         with nogil:
   -> 1408             check_status(self.writer.get()
      1409                          .WriteTable(deref(ctable), c_row_group_size))
      1410 
   
   ~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
        97                 raise IOError(errno, message)
        98             else:
   ---> 99                 raise IOError(message)
       100         elif status.IsOutOfMemory():
       101             raise ArrowMemoryError(message)
   
   OSError: Per ARROW-9424, writing files with LZ4 compression has been disabled until implementation issues have been resolved. It is recommended to read any existing files and rewrite them using a different compression.
   In ../src/parquet/arrow/writer.cc, line 684, code: WriteColumnChunk(table.column(i), offset, size)
   
   In [3]: pq.read_table('example.parquet.lz4').to_pandas()                                                                                                                                       
   Out[3]: 
      f0
   0   1
   1   2
   2   3
   3   4
   4   5
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658437573


   @pitrou @xhochy It seems that despite adding the LZ4_FRAME format we've been continuing to use LZ4_RAW for Parquet files. Unfortunate that this hasn't seen more compatibility testing. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658437867


   Wasn't it deliberate? IIRC we didn't want to break compatibility with existing files.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm edited a comment on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658438428


   I don't recall but that may have been the case. Either way it's a giant mess since many people use pyarrow to write Parquet files to be consumed by JVM-based systems. I think we can infer that LZ4 is not often used from the fact that we haven't had more bug reports about it.
   
   Note that we can provide backward compatibility if needed for existing LZ4-compressed files by looking at the version number in the file footer


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658431456


   No problem, I can take it from here. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm closed pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec

Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7757:
URL: https://github.com/apache/arrow/pull/7757


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org