You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2020/06/12 07:56:00 UTC

[jira] [Commented] (PARQUET-1872) Add TransCompression command

    [ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134030#comment-17134030 ] 

Gabor Szadovszky commented on PARQUET-1872:
-------------------------------------------

[~shangx@uber.com], I don't know why the PR was not linked here automatically. Please, add it manually to have the reference.
I don't get the sub-tasks. In the PR #796 header you reference this jira while you already resolve the parquet-tools related sub-task in it. I think, adding this functionality to {{parquet-cli}} shouldn't be a big deal to separate to another task. (I would suggest implementing the functionality at one place and invoke it from {{parquet-tools}} and {{parquet-cli}}.
What is the bloom filter support is about? I am not sure about the bloom filters but offset indexes surely have to be updated as the page offsets will change. Without it the feature is incorrect so I would not merge a PR to master without implementing it.

> Add TransCompression command 
> -----------------------------
>
>                 Key: PARQUET-1872
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1872
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data ZSTD compressed which can achieve a higher compression ratio. It would be useful if we can have a tool to convert a Parquet file directly by just decompressing/compressing each page without decoding/encoding or assembling the record because it is much faster. The initial result shows it is ~5 times faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)