You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Jinfeng Ni (JIRA)" <ji...@apache.org> on 2016/01/07 20:21:39 UTC
[jira] [Commented] (DRILL-3838) Ability to use UDFs in the directory pruning process

    [ https://issues.apache.org/jira/browse/DRILL-3838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087911#comment-15087911 ] 

Jinfeng Ni commented on DRILL-3838:
-----------------------------------

Agreed with the semantics of "sideband" join to model the directory-based pruning process. The main question is when and where such join semantics should happen. To me, it makes sense to 1) make it happen in planner , 2) the query planner will execute a meta-data query when do the planning for a regular query. 

If we treat the directory name (dir0, dir1, ,), full path name, or file names as meta-data for the file system based data sources, then the pruning itself could be modeled as the querying meta-data, just as what [~julianhyde] described. 

This is not a new problem; relational DB has shown how it is solved : by querying the catalog table during the planning for a regular query. 

For a regular query:
{code}
  select .... from T1, T2, ... WHERE ....
{code}

1) The planner would build and execute many catalog queries, including the following one to find the index for T1:
{code}
  select indexname, other_attributes from sys.catalog.sysindexes where tablename = 'T1' and ...
{code}

2) The catalog query goes through run-time execution and gets the result back to the planner.  The planner then looks through the list of indexes available on T1, and build the index scan plan (either single index scan, or multiple index plan). Then, planner would compare the table scan vs index plan, and eventually find the final plan for the regular query, which will be given to run-time for execution.

For directory-based partition pruning, similar idea  could be applied.  The planner builds a meta-data query, which queries against directory-name, file name or full path, and execute the meta-data query. The planner then uses the result of meta-data query to refine the FileSelection for the original regular query. ( The "refinement" process could be thought as "sideband" join ?)

Today, the meta-data evaluation is done using an interpreter, which is 1) single thread, 2) slow. If we could model the meta-data evaluation just as a regular query against the catalog data, then the evaluation of meta-data could be 1) multi-threaded, 2) use run-time execution which is much faster than interpreter. This part is missing in today's codebase. 

I should stress the benefits to have the meta-data evaluation in planning time, in stead of the execution time. 

1) it essentially apply the filter at earlier stage, which normally means better performance.

2) more importantly, it would give planner a more accurate estimation of statistics (rowCount, etc).  Such statistics plays critical role in deciding the join order, and physical distributed join method etc, which I think is critical for Drill to get good performance for multiple table query.



> Ability to use UDFs in the directory pruning process
> ----------------------------------------------------
>
>                 Key: DRILL-3838
>                 URL: https://issues.apache.org/jira/browse/DRILL-3838
>             Project: Apache Drill
>          Issue Type: New Feature
>          Components: Query Planning & Optimization
>    Affects Versions: 1.2.0
>            Reporter: Stefán Baxter
>
> This feature request is about allowing UDFs to participate in the Directory/Partition pruning process at runtime rather than at planing/optimization time.
> For this a UDF needs:
>  - filename
>  - full path (not just dirN)
>  - to be able to throw a IgnoreFile exception
>  - to be able to throw a IgnoreDirecotyr exception
> I think the naming is pretty self explanatory and hopefully this brief description is enough.
> _Stefan 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)