You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Arina Ielchiieva (Jira)" <ji...@apache.org> on 2020/02/04 12:49:00 UTC

[jira] [Commented] (DRILL-7567) Metastore enhancements

    [ https://issues.apache.org/jira/browse/DRILL-7567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029821#comment-17029821 ] 

Arina Ielchiieva commented on DRILL-7567:
-----------------------------------------

{quote}
The Iceberg metastore requires atomic rename. But, the most common use case for Drill today is the cloud. S3 does not support atomic rename. We need to fix this.
{quote}
It's not a bug, it's limitation of Iceberg, there is nothing we can do. User might need another Metastore implementation for S3. FOr example, I plan to deliver RDBMS Metastore implementation what would required a DB setup though.
Everything has it's own limitations, you know.
{quote}
Review the internal Metastore tables. See many comments about the structure in the Metastore documentation PR.

{quote}
The documentation says we us the "plugin name" as part of the table key. But, for DFS, say, the user can have dozens of plugin configs, each with a distinct name. Each can reuse the same workspace name of, say "foo". Thus "dfs/foo" is ambiguous. But, "hdfs1/foo", and "local/foo" are unique if we use storage plugin config names.
{quote}
We do use config names, dfs is config name commonly used in all Drill examples. I think Vova will update the docs to resolve this ambiguity.
{quote}
It is not clear if the Iceberg metastore supports HDFS security and Kerberos tickets. If not, then it won't work in a production deployment.
{quote}
User how is running the Drillbit should have access to the directory where Iceberg metadata is stored. As far as I understand, Iceberg is not intended to support security.
{quote}
The metastore is meant to store schema. A key use is when schema is ambiguous. But, metastore gathers schema the same way that Drill queries tables. If schema is ambiguous, the ANALYZE TABLE will fail. Thus we do not actually solve the ambiguous schema problem. We need a solution.
{quote}
I think one of the options would that user can provide / correct schema himself if Drill cannot correctly define the schema. Similar to schema provisioning but for Metastore. Vova please confirm.
{quote}
Review the internal Metastore tables. See many comments about the structure in the Metastore documentation PR.
{quote}
I am confused here, Metastore does not have any internal tables as all. If you are referring INFORMATION_SCHEMA tables, they do now store Metastore metadata, they just query Metastore and expose it's data for the users, note, not all data even just some parts that might be useful.

 

> Metastore enhancements
> ----------------------
>
>                 Key: DRILL-7567
>                 URL: https://issues.apache.org/jira/browse/DRILL-7567
>             Project: Apache Drill
>          Issue Type: Improvement
>            Reporter: Paul Rogers
>            Priority: Major
>
> The Metastore feature shipped as a Beta. Review of the documentation identified a number of opportunities for improvement before the feature leaves Beta.
> * Should the Metastore be configured in its own file? Does this push us in the direction of each feature having its own set of config files? Or, should config move into the normal Drill config files?
> * Provide a detailed schema and description of Metadata entities, like the Hive metadata schema.
> * Provide an out-of-the-box sample Metastore for some of Drills demo tables.
> * Provide a Metastore tutorial. Refer to the sample Metastore in the tutorial. Many folks learn best by trying things hands-on.
> * Solve read/write consistency issues to avoid the need for the error/recovery described for {{metastore.metadata.fallback_to_file_metadata}}.
> * Boot-time config is stored in the {{drill.metastore}} namespace. But, Metastore SYSTEM/SESSION options are in the {{drill.exec}} namespace. This is confusing. Let's be consistent.
> * {{drill.exec.storage.implicit.last_modified_time.column.label}} is a bug: Drill internal names should never conflict with user-defined column names. Figure out where they conflict the issue. No user can ever guarantee that some name will never be used in their tables. Nor can users easily fix the issue if it occurs. (Note: this is a flaw with our implicit columns as well.)
> * Provide a form of ANALYZE TABLE that automatically reuses settings from any previous run. It will otherwise be very user unfriendly for the user to have to find a place to store the ANALYZE TABLE command so that they can submit exactly the same one each time. In fact, experience with Impala suggests that end users will have no idea about schema, they just want the latest metadata. Such users won't even know the details of a command some other user might have submitted.
> * The Iceberg metastore requires atomic rename. But, the most common use case for Drill today is the cloud. S3 does not support atomic rename. We need to fix this.
> * The documentation says we us the "plugin name" as part of the table key. But, for DFS, say, the user can have dozens of plugin configs, each with a distinct name. Each can reuse the same workspace name of, say "foo". Thus "dfs/foo" is ambiguous. But, "hdfs1/foo", and "local/foo" are unique if we use storage plugin config names.
> * It is not clear if the Iceberg metastore supports HDFS security and Kerberos tickets. If not, then it won't work in a production deployment.
> * The metastore is meant to store schema. A key use is when schema is ambiguous. But, metastore gathers schema the same way that Drill queries tables. If schema is ambiguous, the ANALYZE TABLE will fail. Thus we do not actually solve the ambiguous schema problem. We need a solution.
> * Better partition support. Drill has a long-standing usability issue that users must do their own partition coding. If I want data from 2018-11 to 2019-02 (one quarter worth of data), I have to write the very ugly
> {code:sql}
> WHERE (dir0 = 2018 AND dir1 >= 11)
>         OR (dir0 = 2019 AND dir1 <= 1)
> {code}
> With Hive/Impala/Presto I can just write:
> {code:sql}
> WHERE transDate IN ('2018-11-01', '2019-01-31')
> {code}
> * Allow staged gathering of stats. Allow me to first gather stats and review them for quality before I have my users start using them. As it is, there is no ability to gather them, enable the option for a session for testing, verify that things work right, then turn it on for everyone. That is, in a shared system, all heck can break loose in the current implementation.
> * Review the internal Metastore tables. See many comments about the structure in the Metastore documentation PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)