You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2022/03/29 02:54:00 UTC

[jira] [Updated] (ORC-1143) [C++] Support reading the PRESENT stream without reading the column data

     [ https://issues.apache.org/jira/browse/ORC-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Quanlong Huang updated ORC-1143:
--------------------------------
    Description: 
Queries like "select count(a) from tbl" just requires checking whether the column value is not NULL. ORC files already have the PRESENT stream for each column (though it's optional). We can serve the request by just reading the PRESENT stream.

Currently, ReadIntent has two items:
{code:java}
enum ReadIntent {
  ReadIntent_ALL = 0,

  // Only read the offsets of selected type. Do not read the children types.
  ReadIntent_OFFSETS = 1
};{code}
We can extend it to add an item like ReadIntent_PRESENT. The corresponding ColumnVectorBatch will only have valid notNull results.

This would help more on string columns. E.g. checking how many customers have email address
{code:sql}
select count(email_address) from tpcds.customer {code}

  was:
Queries like "select count(a) from tbl" just requires checking whether the column value is not NULL. ORC files already have the PRESENT stream for each column (though it's optional). We can serve the request by just reading the PRESENT stream.

Currently, ReadIntent has two items:
{code:java}
enum ReadIntent {
  ReadIntent_ALL = 0,

  // Only read the offsets of selected type. Do not read the children types.
  ReadIntent_OFFSETS = 1
};{code}
We can extend it to add an item like ReadIntent_PRESENT. The corresponding ColumnVectorBatch will only have valid notNull results.


> [C++] Support reading the PRESENT stream without reading the column data
> ------------------------------------------------------------------------
>
>                 Key: ORC-1143
>                 URL: https://issues.apache.org/jira/browse/ORC-1143
>             Project: ORC
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Quanlong Huang
>            Priority: Major
>
> Queries like "select count(a) from tbl" just requires checking whether the column value is not NULL. ORC files already have the PRESENT stream for each column (though it's optional). We can serve the request by just reading the PRESENT stream.
> Currently, ReadIntent has two items:
> {code:java}
> enum ReadIntent {
>   ReadIntent_ALL = 0,
>   // Only read the offsets of selected type. Do not read the children types.
>   ReadIntent_OFFSETS = 1
> };{code}
> We can extend it to add an item like ReadIntent_PRESENT. The corresponding ColumnVectorBatch will only have valid notNull results.
> This would help more on string columns. E.g. checking how many customers have email address
> {code:sql}
> select count(email_address) from tpcds.customer {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)