You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/04/17 13:19:00 UTC
[jira] [Commented] (DRILL-6331) Parquet filter pushdown does not support the native hive reader

    [ https://issues.apache.org/jira/browse/DRILL-6331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440838#comment-16440838 ] 

ASF GitHub Bot commented on DRILL-6331:
---------------------------------------

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/1214

    DRILL-6331: Revisit Hive Drill native parquet implementation to be ex…

    …posed to Drill optimizations (filter / limit push down, count to direct scan)
    
    1. Factored out common logic for Drill parquet reader and Hive Drill native parquet readers: AbstractParquetGroupScan, AbstractParquetRowGroupScan, AbstractParquetScanBatchCreator.
    2. Rules that worked previously only with ParquetGroupScan, now can be applied for any class that extends AbstractParquetGroupScan: DrillFilterItemStarReWriterRule, ParquetPruneScanRule, PruneScanRule.
    3. Hive populated partition values based on information returned from Hive metastore. Drill populates partition values based on path difference between selection root and actual file path.
       Before ColumnExplorer populated partition values based on Drill approach. Since now ColumnExplorer populates values for parquet files from Hive tables,
       `populateImplicitColumns` method logic was changed to populated partition columns only based on given partition values.
    4. Refactored ParquetPartitionDescriptor to be responsible for populating partition values rather than storing this logic in parquet group scan class.
    5. Metadata class was moved to separate metadata package (org.apache.drill.exec.store.parquet.metadata). Factored out several inner classed to improve code readability.
    6. Collected all Drill native parquet reader unit tests into one class TestHiveDrillNativeParquetReader, also added new tests to cover new functionality.
    7. Reduced excessive logging when parquet files metadata is read.
    8. Added Drill stopwatch implementation (includes wrapper around Guava stopwatch and DummyStopwatch). This would help to save system resources when debug level is not enabled.
    
    Link to Jira - [DRILL-6331](https://issues.apache.org/jira/browse/DRILL-6331).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-6331

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1214
    
----
commit af5dff61b6b70c4ef70d4a5173aa63f5faa9c2c0
Author: Arina Ielchiieva <ar...@...>
Date:   2018-03-20T18:29:45Z

    DRILL-6331: Revisit Hive Drill native parquet implementation to be exposed to Drill optimizations (filter / limit push down, count to direct scan)
    
    1. Factored out common logic for Drill parquet reader and Hive Drill native parquet readers: AbstractParquetGroupScan, AbstractParquetRowGroupScan, AbstractParquetScanBatchCreator.
    2. Rules that worked previously only with ParquetGroupScan, now can be applied for any class that extends AbstractParquetGroupScan: DrillFilterItemStarReWriterRule, ParquetPruneScanRule, PruneScanRule.
    3. Hive populated partition values based on information returned from Hive metastore. Drill populates partition values based on path difference between selection root and actual file path.
       Before ColumnExplorer populated partition values based on Drill approach. Since now ColumnExplorer populates values for parquet files from Hive tables,
       `populateImplicitColumns` method logic was changed to populated partition columns only based on given partition values.
    4. Refactored ParquetPartitionDescriptor to be responsible for populating partition values rather than storing this logic in parquet group scan class.
    5. Metadata class was moved to separate metadata package (org.apache.drill.exec.store.parquet.metadata). Factored out several inner classed to improve code readability.
    6. Collected all Drill native parquet reader unit tests into one class TestHiveDrillNativeParquetReader, also added new tests to cover new functionality.
    7. Reduced excessive logging when parquet files metadata is read.
    8. Added Drill stopwatch implementation (includes wrapper around Guava stopwatch and DummyStopwatch). This would help to save system resources when debug level is not enabled.

----


> Parquet filter pushdown does not support the native hive reader
> ---------------------------------------------------------------
>
>                 Key: DRILL-6331
>                 URL: https://issues.apache.org/jira/browse/DRILL-6331
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Storage - Hive
>    Affects Versions: 1.13.0
>            Reporter: Arina Ielchiieva
>            Assignee: Arina Ielchiieva
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Initially HiveDrillNativeParquetGroupScan was based mainly on HiveScan, the core difference between them was
> that HiveDrillNativeParquetScanBatchCreator was creating ParquetRecordReader instead of HiveReader.
> This allowed to read Hive parquet files using Drill native parquet reader but did not expose Hive data to Drill optimizations.
> For example, filter push down, limit push down, count to direct scan optimizations.
> Hive code had to be refactored to use the same interfaces as ParquestGroupScan in order to be exposed to such optimizations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)