You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/03/11 19:40:39 UTC

[jira] [Commented] (DRILL-1328) Support table statistics

    [ https://issues.apache.org/jira/browse/DRILL-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191353#comment-15191353 ] 

ASF GitHub Bot commented on DRILL-1328:
---------------------------------------

GitHub user vkorukanti opened a pull request:

    https://github.com/apache/drill/pull/425

    DRILL-1328: Support table statistics

    Patch attached to the JIRA is seems to be useful for generating table stats and using them for query planning. I rebased the patch to latest master, fixed few issues and added few tests.
    
    It still needs work to make it a full fledged feature, but I think the current state of the patch is good enough to commit and make improvements/fixes later.
    
    @jinfengni and @amansinha100 : Could you please review the patch?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vkorukanti/drill DRILL-1328-r1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/425.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #425
    
----
commit 079d109ee40d1be0ce6f5cfed5a091da41a477dc
Author: Cliff Buchanan <cb...@maprtech.com>
Date:   2014-08-21T21:59:53Z

    DRILL-1328: Support table statistics
    
    PRE: Add "append" concept to directory write.
    
    * This is so stats can be stored in [table].stats.drill and be appended to be writing a new file into the directory.
    
    FUNCS: Statistics functions as UDFs:
    Currently using FieldReader to ensure consistent output type so that Unpivot doesn't get confused. All stats columns should be Nullable, so that stats functions can return NULL when N/A.
    * custom versions of "count" that always return BigInt
    * HyperLogLog based NDV that returns BigInt that works only on VarChars
    * HyperLogLog with binary output that only works on VarChars
    
    OPS: Updated protobufs for new ops
    
    OPS: Implemented StatisticsAggregate
    
    OPS: Implemented StatisticsUnpivot
    
    ANALYZE: AnalyzeTable functionality
    * JavaCC syntax more-or-less copied from LucidDB.
    * (Basic) AnalyzePrule: DrillAnalyzeRel -> UnpivotPrel and StatsAggPrel
    
    ANALYZE: Add getMetadataTable() to AbstractSchema
    
    USAGE: Change field access in QueryWrapper
    
    USAGE: Add getDrillTable() to DrillScanRelBase and ScanPrel
    * since ScanPrel does not inherit from DrillScanRelBase, this requires adding a DrillTable to the constructor
    * This is done so that a custom ReflectiveRelMetadataProvider can access the DrillTable associated with Logical/Physical scans.
    
    USAGE: Attach DrillTableMetadata to DrillTable.
    * DrillTableMetadata represents the data scanned from a corresponding ".stats.drill" table
    * In order to avoid doing query execution right after the ".stats.drill" table is found, metadata is not actually collected until the MaterializationVisitor is used.
    ** Currently, the metadata source must be a string (so that a SQL query can be created). Doing this with a table is probably more complicated.
    ** Query is set up to extract only the most recent statistics results for each column.
    
    USAGE: Configure DrillJoinRelBase to use NDV metadata when available.
    
    USAGE: attach metadata to table
    
    USAGE: implement optiq provider

----


> Support table statistics
> ------------------------
>
>                 Key: DRILL-1328
>                 URL: https://issues.apache.org/jira/browse/DRILL-1328
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Cliff Buchanan
>             Fix For: Future
>
>         Attachments: 0001-PRE-Set-value-count-in-splitAndTransfer.patch
>
>
> This consists of several subtasks
> * implement operators to generate statistics
> * add "analyze table" support to parser/planner
> * create a metadata provider to allow statistics to be used by optiq in planning optimization
> * implement statistics functions
> Right now, the bulk of this functionality is implemented, but it hasn't been rigorously tested and needs to have some definite answers for some of the parts "around the edges" (how analyze table figures out where the table statistics are located, how a table "append" should work in a read only file system)
> Also, here are a few known caveats:
> * table statistics are collected by creating a sql query based on the string path of the table. This should probably be done with a Table reference.
> * Case sensitivity for column statistics is probably iffy
> * Math for combining two column NDVs into a joint NDV should be checked.
> * Schema changes aren't really being considered yet.
> * adding getDrillTable is probably unnecessary; it might be better to do getTable().unwrap(DrillTable.class)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)