You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/11/05 06:43:00 UTC

[jira] [Commented] (DRILL-6791) Merge scan projection framework into master

    [ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16674733#comment-16674733 ] 

ASF GitHub Bot commented on DRILL-6791:
---------------------------------------

paul-rogers commented on a change in pull request #1501: DRILL-6791: Scan projection framework
URL: https://github.com/apache/drill/pull/1501#discussion_r230642040
 
 

 ##########
 File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/project/ScanLevelProjection.java
 ##########
 @@ -0,0 +1,349 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.scan.project;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.physical.rowSet.project.RequestedTuple;
+import org.apache.drill.exec.physical.rowSet.project.RequestedTuple.RequestedColumn;
+import org.apache.drill.exec.physical.rowSet.project.RequestedTupleImpl;
+
+/**
+ * Parses and analyzes the projection list passed to the scanner. The
+ * projection list is per scan, independent of any tables that the
+ * scanner might scan. The projection list is then used as input to the
+ * per-table projection planning.
+ * <p>
+ * Accepts the inputs needed to
+ * plan a projection, builds the mappings, and constructs the projection
+ * mapping object.
+ * <p>
+ * Builds the per-scan projection plan given a set of projected columns.
+ * Determines the output schema, which columns to project from the data
+ * source, which are metadata, and so on.
+ * <p>
+ * An annoying aspect of SQL is that the projection list (the list of
+ * columns to appear in the output) is specified after the SELECT keyword.
+ * In Relational theory, projection is about columns, selection is about
+ * rows...
+ * <p>
+ * Mappings can be based on three primary use cases:
+ * <ul>
+ * <li><tt>SELECT *</tt>: Project all data source columns, whatever they happen
+ * to be. Create columns using names from the data source. The data source
+ * also determines the order of columns within the row.</li>
+ * <li><tt>SELECT columns</tt>: Similar to SELECT * in that it projects all columns
+ * from the data source, in data source order. But, rather than creating
+ * individual output columns for each data source column, creates a single
+ * column which is an array of Varchars which holds the (text form) of
+ * each column as an array element.</li>
+ * <li><tt>SELECT a, b, c, ...</tt>: Project a specific set of columns, identified by
+ * case-insensitive name. The output row uses the names from the SELECT list,
+ * but types from the data source. Columns appear in the row in the order
+ * specified by the SELECT.</li>
+ * <li<tt>SELECT ...</tt>: SELECT nothing, occurs in <tt>SELECT COUNT(*)</tt>
+ * type queries. The provided projection list contains no (table) columns, though
+ * it may contain metadata columns.</li>
+ * </ul>
+ * Names in the SELECT list can reference any of five distinct types of output
+ * columns:
+ * <ul>
+ * <li>Wildcard ("*") column: indicates the place in the projection list to insert
+ * the table columns once found in the table projection plan.</li>
+ * <li>Data source columns: columns from the underlying table. The table
+ * projection planner will determine if the column exists, or must be filled
+ * in with a null column.</li>
+ * <li>The generic data source columns array: <tt>columns</tt>, or optionally
+ * specific members of the <tt>columns</tt> array such as <tt>columns[1]</tt>.</li>
+ * <li>Implicit columns: <tt>fqn</tt>, <tt>filename</tt>, <tt>filepath</tt>
+ * and <tt>suffix</tt>. These reference
+ * parts of the name of the file being scanned.</li>
+ * <li>Partition columns: <tt>dir0</tt>, <tt>dir1</tt>, ...: These reference
+ * parts of the path name of the file.</li>
+ * </ul>
+ *
+ * @see {@link ImplicitColumnExplorer}, the class from which this class
+ * evolved
+ */
+
+public class ScanLevelProjection {
+
+  static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ScanLevelProjection.class);
+
+  /**
+   * Interface for add-on parsers, avoids the need to create
+   * a single, tightly-coupled parser for all types of columns.
+   * The main parser handles wildcards and assumes the rest of
+   * the columns are table columns. The add-on parser can tag
+   * columns as special, such as to hold metadata.
+   */
+
+  public interface ScanProjectionParser {
+    void bind(ScanLevelProjection builder);
+    boolean parse(RequestedColumn inCol);
+    void validate();
+    void validateColumn(ColumnProjection col);
+    void build();
+  }
+
+  // Input
+
+  protected final List<SchemaPath> projectionList;
+
+  // Configuration
+
+  protected List<ScanProjectionParser> parsers;
+  private final boolean v1_12MetadataLocation;
+
+  // Internal state
+
+  protected boolean sawWildcard;
+
+  // Output
+
+  protected List<ColumnProjection> outputCols = new ArrayList<>();
+  protected RequestedTuple tableLoaderProjection;
+  protected boolean hasWildcard;
+  protected boolean emptyProjection = true;
+
+  /**
+   * Specify the set of columns in the SELECT list. Since the column list
+   * comes from the query planner, assumes that the planner has checked
+   * the list for syntax and uniqueness.
+   *
+   * @param queryCols list of columns in the SELECT list in SELECT list order
+   * @return this builder
+   */
+  public ScanLevelProjection(List<SchemaPath> projectionList,
+      List<ScanProjectionParser> parsers,
+      boolean v1_12MetadataLocation) {
+    this.projectionList = projectionList;
+    this.parsers = parsers;
+    this.v1_12MetadataLocation = v1_12MetadataLocation;
+    doParse();
+  }
+
+  private void doParse() {
+    tableLoaderProjection = RequestedTupleImpl.parse(projectionList);
+
+    for (ScanProjectionParser parser : parsers) {
+      parser.bind(this);
+    }
+    for (RequestedColumn inCol : tableLoaderProjection.projections()) {
+      if (inCol.isWildcard()) {
+        mapWildcard(inCol);
+      } else {
+        mapColumn(inCol);
+      }
+    }
+    verify();
+    for (ScanProjectionParser parser : parsers) {
+      parser.build();
+    }
+  }
+
+  public ScanLevelProjection(List<SchemaPath> projectionList,
+      List<ScanProjectionParser> parsers) {
+    this(projectionList, parsers, false);
+  }
+
+  /**
+   * Wildcard is special: add it, then let parsers add any custom
+   * columns that are needed. The order is important: we want custom
+   * columns to follow table columns.
+   */
+
+  private void mapWildcard(RequestedColumn inCol) {
+
+    // Wildcard column: this is a SELECT * query.
+
+    if (sawWildcard) {
+      throw new IllegalArgumentException("Duplicate * entry in project list");
+    }
+
+    // Remember the wildcard position, if we need to insert it.
+    // Ensures that the main wildcard expansion occurs before add-on
+    // columns.
+
+    int wildcardPosn = outputCols.size();
+
+    // Parsers can consume the wildcard. But, all parsers must
+    // have visibility to the wildcard column.
+
+    for (ScanProjectionParser parser : parsers) {
+      if (parser.parse(inCol)) {
+        wildcardPosn = -1;
+      }
+    }
+
+    // Set this flag only after the parser checks.
+
+    sawWildcard = true;
+
+    // If not consumed, put the wildcard column into the projection list as a
+    // placeholder to be filled in later with actual table columns.
+
+    if (wildcardPosn != -1) {
+
+      // Drill 1.1 - 1.11 and Drill 1.13 or later put metadata columns after
+      // data columns. Drill 1.12 moved them before data columns. For testing
+      // and compatibility, the client can request to use the Drill 1.12 position,
+      // though the after-data position is the default.
+      //
+      // Note that the after-data location is much more convenient for the dirx
+      // partition columns since these vary in number across scans within the same query.
+      // By putting them at the end, the index of all other columns remains
+      // constant. Drill 1.12 broke that behavior, but Drill 1.13 restored it.
+      //
+      // This option can be removed in Drill 1.14 after things settle down.
 
 Review comment:
   Let's leave it for now. We can test it once we get the revised readers integrated and the test case that triggered this hack will tell us if Drill still behaves as in 1.12, or if the 1.11 and earlier behavior has been restored.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Merge scan projection framework into master
> -------------------------------------------
>
>                 Key: DRILL-6791
>                 URL: https://issues.apache.org/jira/browse/DRILL-6791
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: 1.15.0
>
>
> Merge the next set of "result set loader" code into master via a PR. This one covers the "schema projection" mechanism which:
> * Handles none (SELECT COUNT\(*)), some (SELECT a, b, x) and all (SELECT *) projection.
> * Handles null columns (for projection a column "x" that does not exist in the base table.)
> * Handles constant columns as used for file metadata (AKA "implicit" columns).
> * Handle schema persistence: the need to reuse the same vectors across different scanners
> * Provides a framework for consuming externally-supplied metadata
> * Since we don't yet have a way to provide "real" metadata, obtains metadata hints from previous batches and from the projection list (a.b implies that "a" is a map, c[0] implies that "c" is an array, etc.)
> * Handles merging the set of data source columns and null columns to create the final output batch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)