You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2020/06/12 07:56:00 UTC
[jira] [Commented] (PARQUET-1872) Add TransCompression command
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134030#comment-17134030 ]
Gabor Szadovszky commented on PARQUET-1872:
-------------------------------------------
[~shangx@uber.com], I don't know why the PR was not linked here automatically. Please, add it manually to have the reference.
I don't get the sub-tasks. In the PR #796 header you reference this jira while you already resolve the parquet-tools related sub-task in it. I think, adding this functionality to {{parquet-cli}} shouldn't be a big deal to separate to another task. (I would suggest implementing the functionality at one place and invoke it from {{parquet-tools}} and {{parquet-cli}}.
What is the bloom filter support is about? I am not sure about the bloom filters but offset indexes surely have to be updated as the page offsets will change. Without it the feature is incorrect so I would not merge a PR to master without implementing it.
> Add TransCompression command
> -----------------------------
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-mr
> Affects Versions: 1.12.0
> Reporter: Xinli Shang
> Assignee: Xinli Shang
> Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data ZSTD compressed which can achieve a higher compression ratio. It would be useful if we can have a tool to convert a Parquet file directly by just decompressing/compressing each page without decoding/encoding or assembling the record because it is much faster. The initial result shows it is ~5 times faster.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)