You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/06/09 07:42:00 UTC
[jira] [Commented] (IMPALA-8011) Allow filtering on virtual column for file name

    [ https://issues.apache.org/jira/browse/IMPALA-8011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552019#comment-17552019 ] 

ASF subversion and git services commented on IMPALA-8011:
---------------------------------------------------------

Commit 23d09638de35dcec6419a5e30df08fd5d8b27e7d in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=23d09638d ]

IMPALA-801, IMPALA-8011: Add INPUT__FILE__NAME virtual column for file name

Hive has virtual column INPUT__FILE__NAME which returns the data file
name that stores the actual row. It can be used in several ways, see the
above two Jira tickets for examples. This virtual column is also needed
to support position-based delete files in Iceberg V2 tables.

This patch also adds the foundations to support further table-level
virtual columns later. Virtual columns are stored at the table level
in a separate list from the table schema. During path resolution
in Path.resolve() we also try to resolve virtual columns. Slot
descriptors also store the information whether they refer to a virtual
column.

Currently we only add the INPUT__FILE__NAME virtual column. The value
of this column can be set in the template tuple of the scanners.

All kinds of operations are possible on this virtual column, users
can invoke additional functions on it, can filter rows, can group by,
etc.

Special care is needed for virtual columns when column masking/row
filtering is applicable on them. They are added as "hidden" select
list items to the table masking views which means they don't
expand by * expressions. They still need to be included in *
expressions though when they are coming from user-written views.

Testing:
 * analyzer tests
 * added e2e tests

Change-Id: I498591f1db08a91a5c846df59086d2291df4ff61
Reviewed-on: http://gerrit.cloudera.org:8080/18514
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Allow filtering on virtual column for file name
> -----------------------------------------------
>
>                 Key: IMPALA-8011
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8011
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>            Reporter: Peter Ebert
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: built-in-function
>
> An additional performance enhancement would be the capability to filter on file names using a virtual column.  This would be somewhat like the current optimization of sorting data and skipping files based on parquet metadata, but instead you put something in the file name to indicate it's contents should be filtered.
> For example say you were writing first names and then searching for them, during your writing phase you put the first letter of the first name into your file name, so if I'm storing Alice, Bob, Cathy, my file name is "ABC" then when doing a query you could filter based on where INPUT__FILE__NAME contains "D" when searching for David and skip reading the file.
> Another use would be if you had a daily partition, and you put the timestamp into the file name, then limit the search to only the last hour even though your partition is daily. This then gives you the ability to sort by another column making searches even faster on both.
>  
> This requires IMPALA-801



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org