You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Shant Hovsepian (Jira)" <ji...@apache.org> on 2020/11/15 20:00:01 UTC
[jira] [Created] (IMPALA-10327) SymlinkTextInputFormat for reading
manifest file based tables.
Shant Hovsepian created IMPALA-10327:
----------------------------------------
Summary: SymlinkTextInputFormat for reading manifest file based tables.
Key: IMPALA-10327
URL: https://issues.apache.org/jira/browse/IMPALA-10327
Project: IMPALA
Issue Type: New Feature
Components: Catalog, Frontend
Reporter: Shant Hovsepian
The [SymlinkTextInputFormat|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java] was an early Hadoop/Hive feature that has recently started to see lots of use. Originally it was used to support symlinks in hive warehouse directories but now it's more commonly used as a way to support specifying the files that make up a hive table without requiring a directory listing operation.
Instead of pointing to a directory of files or partitions the Hive table metadata refers to a single directory containing "manifest files", these files have a well defined format which specifies the files that constitute the table.
This mechanism is used by in the following cases.
* Delta Lakes uses manifest to generate consistent read-only views of its table format for use by Presto, Hive, and Redshift Spectrum [https://docs.delta.io/0.7.0/integrations.html]
* AWS Redshift can UNLOAD Redshift tables and partitions to corresponding parquets files on S3 for consumption by other tools: [https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html]
* AWS S3 Inventories: [https://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inventory.html]
Using the functionality with HDFS and S3 even without the need to interop with the above would provider performance benefits by avoiding expensive directory listing operations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org