You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Ryan Blue (JIRA)" <ji...@apache.org> on 2018/02/26 19:07:00 UTC

[jira] [Comment Edited] (PARQUET-796) Delta Encoding is not used when dictionary enabled

    [ https://issues.apache.org/jira/browse/PARQUET-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377382#comment-16377382 ] 

Ryan Blue edited comment on PARQUET-796 at 2/26/18 7:06 PM:
------------------------------------------------------------

I don't recommend using the delta long encoding because I think we need to update to better encodings (specifically, the zig-zag-encoding ones in [this branch|https://github.com/rdblue/parquet-mr/commits/encoders]).

We could definitely use a better fallback, but I don't think the solution is to turn off dictionary encoding. If you can use dictionary encoding to get a smaller size, you should. The problem is when dictionary encoding needs to test whether another encoding would be better. It currently tests against plain and uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to test out different ways of choosing an encoding at write time. But we do not want to make it so that users must specify their own encodings because we want Parquet to select them automatically and get the choice right. PARQUET-601 is about testing out strategies that we release as the defaults.


was (Author: rdblue):
I don't recommend using the delta long encoding because I think we need to update to better encodings (specifically, the zig-zag-encoding ones in this branch).

We could definitely use a better fallback, but I don't think the solution is to turn off dictionary encoding. If you can use dictionary encoding to get a smaller size, you should. The problem is when dictionary encoding needs to test whether another encoding would be better. It currently tests against plain and uses plain. We should have it test against a delta encoding and use one.

This kind of improvement is why we added PARQUET-601. We want to be able to test out different ways of choosing an encoding at write time. But we do not want to make it so that users must specify their own encodings because we want Parquet to select them automatically and get the choice right. PARQUET-601 is about testing out strategies that we release as the defaults.

> Delta Encoding is not used when dictionary enabled
> --------------------------------------------------
>
>                 Key: PARQUET-796
>                 URL: https://issues.apache.org/jira/browse/PARQUET-796
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-mr
>    Affects Versions: 1.9.0
>            Reporter: Jakub Liska
>            Priority: Critical
>             Fix For: 1.9.1
>
>
> Current code doesn't enable using both Delta Encoding and Dictionary Encoding. If I instantiate ParquetWriter like this : 
> {code}
> val writer = new ParquetWriter[Group](outFile, new GroupWriteSupport, codec, blockSize, pageSize, dictPageSize, enableDictionary = true, true, ParquetProperties.WriterVersion.PARQUET_2_0, configuration)
> {code}
> Then this piece of code : 
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultValuesWriterFactory.java#L78-L86
> Causes that DictionaryValuesWriter is used instead of the inferred DeltaLongEncodingWriter. 
> The original issue is here : https://github.com/apache/parquet-mr/pull/154#issuecomment-266489768



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)