You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/14 16:04:40 UTC
[GitHub] [arrow] patrickpai opened a new pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
patrickpai opened a new pull request #7757:
URL: https://github.com/apache/arrow/pull/7757
Due to ongoing LZ4 problems with Parquet files, this patch disables writing files with LZ4 codec by throwing a `ParquetException`.
Mailing list discussion: https://mail-archives.apache.org/mod_mbox/arrow-dev/202007.mbox/%3CCAJPUwMCM4ZaJB720%2BuoM1aSA2oD9jSEnzuwWjJiw6vwXxHk7nw%40mail.gmail.com%3E
Jira ticket: https://issues.apache.org/jira/browse/ARROW-9424
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658393753
Ah I see that you're adding Python changes. I fixed the lint problems here so be sure to rebase your changes
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658438428
I don't recall but that may have been the case. Either way it's a giant mess since many people use pyarrow to write Parquet files to be consumed by JVM-based systems. I think we can infer that LZ4 is not often used from the fact that we haven't had more bug reports about it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658429763
@patrickpai do you anticipate to complete this today? We are hoping to cut a release candidate tomorrow during the workday Central Europe Time so I can help finish this if needed
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658438215
In any case, the format used by Hadoop is neither of both, it's LZ4_RAW with a custom header...
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658266555
https://issues.apache.org/jira/browse/ARROW-9424
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] patrickpai commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
patrickpai commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658430846
@wesm I think it's best if I get help on this. I'm totally new to the python codebase and wasn't expecting to finish today. I can let you take over.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658468591
OK, writing is disabled but old files can still be read
```
n [2]: pq.write_table(table, 'not_allowed.parquet.lz4', compression='lz4')
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-2-597ef4749b0a> in <module>
----> 1 pq.write_table(table, 'not_allowed.parquet.lz4', compression='lz4')
~/code/arrow/python/pyarrow/parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, compression_level, use_byte_stream_split, data_page_version, **kwargs)
1632 data_page_version=data_page_version,
1633 **kwargs) as writer:
-> 1634 writer.write_table(table, row_group_size=row_group_size)
1635 except Exception:
1636 if _is_path_like(where):
~/code/arrow/python/pyarrow/parquet.py in write_table(self, table, row_group_size)
586 raise ValueError(msg)
587
--> 588 self.writer.write_table(table, row_group_size=row_group_size)
589
590 def close(self):
~/code/arrow/python/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetWriter.write_table()
1406
1407 with nogil:
-> 1408 check_status(self.writer.get()
1409 .WriteTable(deref(ctable), c_row_group_size))
1410
~/code/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
97 raise IOError(errno, message)
98 else:
---> 99 raise IOError(message)
100 elif status.IsOutOfMemory():
101 raise ArrowMemoryError(message)
OSError: Per ARROW-9424, writing files with LZ4 compression has been disabled until implementation issues have been resolved. It is recommended to read any existing files and rewrite them using a different compression.
In ../src/parquet/arrow/writer.cc, line 684, code: WriteColumnChunk(table.column(i), offset, size)
In [3]: pq.read_table('example.parquet.lz4').to_pandas()
Out[3]:
f0
0 1
1 2
2 3
3 4
4 5
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658437573
@pitrou @xhochy It seems that despite adding the LZ4_FRAME format we've been continuing to use LZ4_RAW for Parquet files. Unfortunate that this hasn't seen more compatibility testing.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658437867
Wasn't it deliberate? IIRC we didn't want to break compatibility with existing files.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm edited a comment on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
wesm edited a comment on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658438428
I don't recall but that may have been the case. Either way it's a giant mess since many people use pyarrow to write Parquet files to be consumed by JVM-based systems. I think we can infer that LZ4 is not often used from the fact that we haven't had more bug reports about it.
Note that we can provide backward compatibility if needed for existing LZ4-compressed files by looking at the version number in the file footer
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm commented on pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
wesm commented on pull request #7757:
URL: https://github.com/apache/arrow/pull/7757#issuecomment-658431456
No problem, I can take it from here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] wesm closed pull request #7757: ARROW-9424: [C++][Parquet] Disable writing files with LZ4 codec
Posted by GitBox <gi...@apache.org>.
wesm closed pull request #7757:
URL: https://github.com/apache/arrow/pull/7757
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org