You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2017/10/25 03:06:00 UTC

[jira] [Comment Edited] (ORC-210) Add new ORC 2.0 encoding for Double, Float.

    [ https://issues.apache.org/jira/browse/ORC-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218025#comment-16218025 ] 

Owen O'Malley edited comment on ORC-210 at 10/25/17 3:05 AM:
-------------------------------------------------------------

Ok, there are a couple points that are clear:
* We must have better datasets.
** TPC-DS is decimal and not floating point data. Additionally, it is synthetic instead of real.
** We need datasets of ~10 million values.
** I'd propose the nyc-taxi drop off long & lat data.
** We need some non-repetitive datasets (cardinality / count > 99%)
* The interesting metrics are:
** Write speed
** Read speed
** Compression
* The multiply by 100 trick on your modified fpc method is too tied to the particular datasets and isn't generally useful.
* We need some high cardinality data sets because the current ones would be best done using a dictionary.


was (Author: owen.omalley):
Ok, there are a couple points that are clear:
* We must have better datasets.
     * TPC-DS is decimal and not floating point data. Additionally, it is synthetic instead of real.
     * We need datasets of ~10 million values.
     * I'd propose the nyc-taxi drop off long & lat data.
     * We need some non-repetitive datasets (cardinality / count > 99%)
* The interesting metrics are:
    * Write speed
    * Read speed
    * Compression
* The multiply by 100 trick on your modified fpc method is too tied to the particular datasets and isn't generally useful.
* We need some high cardinality data sets because the current ones would be best done using a dictionary.

> Add new ORC 2.0 encoding for Double, Float.
> -------------------------------------------
>
>                 Key: ORC-210
>                 URL: https://issues.apache.org/jira/browse/ORC-210
>             Project: ORC
>          Issue Type: Improvement
>          Components: encoding, Java
>    Affects Versions: 2.0.0
>            Reporter: Dapeng Sun
>            Assignee: Teddy Choi
>         Attachments: ORC-210.1.patch, ORC-210.2.patch, patch.txt
>
>
> Currently, Double and Float are using PLAIN encoding, it is better to support encoding such as Dictionary or BitPacking to reduce the storage cost.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)