You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Brock Noland (JIRA)" <ji...@apache.org> on 2014/11/11 17:09:33 UTC

[jira] [Commented] (PARQUET-131) Supporting Vectorized APIs in Parquet

    [ https://issues.apache.org/jira/browse/PARQUET-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14206579#comment-14206579 ] 

Brock Noland commented on PARQUET-131:
--------------------------------------

Hi,

Thank you very much for creating this! I sincerely appreciate you taking the type to create this proposal!

>From the Hive side, I have the following feedback:

My understanding is that {{ColumnVector}} is an interface so we can provide our own impl. This will be required for Hive since we have our own {{ColumnVector}} impl and it's extremely widely used. I don't think this version of the {{ColumnVector}} interface will provide pluggability for the following reasons:

# Impls e.g {{LongVector}} have public members. This same thing was done in Hive (not use getters and setters) but IMO for dubious reasons. No proof was provided that shows JIT does not optimize the getters setters out.
# Drill, Hive, etc will be required to extend {{LongVector}} in order to make this work, but that would require massive change on the Hive side. We should provide getters and setters on the interface for the data types so that Hive can simply implement the {{ColumnVector}} interface with our existing implementation. We might also need to provide {{isLongVector}} methods so we know the type of the {{ColumnVector}}.
# I don't understanding why {{ColumnVector}} has an {{getEncoding}}. Isn't an encoding a storage feature not a column vector feature?

> Supporting Vectorized APIs in Parquet
> -------------------------------------
>
>                 Key: PARQUET-131
>                 URL: https://issues.apache.org/jira/browse/PARQUET-131
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Zhenxiao Luo
>            Assignee: Zhenxiao Luo
>
> Vectorized Query Execution could have big performance improvement for SQL engines like Hive, Drill, and Presto. Instead of processing one row at a time, Vectorized Query Execution could streamline operations by processing a batch of rows at a time. Within one batch, each column is represented as a vector of a primitive data type. SQL engines could apply predicates very efficiently on these vectors, avoiding a single row going through all the operators before the next row can be processed.
> As an efficient columnar data representation, it would be nice if Parquet could support Vectorized APIs, so that all SQL engines could read vectors from Parquet files, and do vectorized execution for Parquet File Format.
>  
> Detail proposal:
> https://gist.github.com/zhenxiao/2728ce4fe0a7be2d3b30



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)