You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Quanlong Huang <hu...@126.com> on 2017/09/12 03:51:37 UTC

Load metadata exactly in need

Hi all,


Currently if a "describe" statement hits an incomplete table, the impalad will send an RPC request to the catalogd for loading metadata of this table. It will take a long time for tables with many partitions and many files. However, to serve the "describe" statement, we just need the metadata in Hive MetaStore. In my experiments (with load_catalog_in_background=false), it take hours to describe a large table. This statement is pretty cheap in Hive or Presto. Users may worry about whether impala is set up correctly.


Can we add a more fine grain strategy about loading the metadata? For queries just hit one partition of a huge table, we don't need to load all the file descriptors as well.  For example, more levels to trigger metadata load:
Level1. Load metadata from Hive MetaStore
Level2. Load file descriptors of given partitions
Level3. Load all file descriptors


Then we can serve the following scenario better:
1. describe a large table
2. run query on one or several partitions of this table. (Each partition has few files)


Do we have some discussion about this before?


Thanks
Quanlong

Re: Re: Load metadata exactly in need

Posted by Dimitris Tsirogiannis <dt...@cloudera.com>.
Thanks for the feedback Quanlong. We plan on addressing many of these
catalog issues in the immediate future.

Dimitris

On Mon, Sep 11, 2017 at 10:21 PM, Quanlong Huang <hu...@126.com>
wrote:

> Hi Dimitris,
>
> Thanks for your quick reply!
>
> IMPALA-3127 is a great ticket. But it still has no progress and no
> assignee. Is it tracked in your internal Jira?
>
> Hopes this can be done soon, since some users may choose Presto instead of
> Impala due to these usability cases.
>
> Thanks
> Quanlong
>
>
> At 2017-09-12 12:17:23, "Dimitris Tsirogiannis" <dt...@cloudera.com> wrote:
> >Hi Quanlong,
> >
> >You're right. The catalog needs to handle metadata at a finer granularity.
> >We are actively looking into the options you mentioned as well as other
> >related changes (see IMPALA-3234 and IMPALA-3127) to improve the
> >performance and scalability of metadata management.
> >
> >Thanks
> >Dimitris
> >
> >On Mon, Sep 11, 2017 at 8:51 PM, Quanlong Huang <hu...@126.com>
> >wrote:
> >
> >> Hi all,
> >>
> >>
> >> Currently if a "describe" statement hits an incomplete table, the impalad
> >> will send an RPC request to the catalogd for loading metadata of this
> >> table. It will take a long time for tables with many partitions and many
> >> files. However, to serve the "describe" statement, we just need the
> >> metadata in Hive MetaStore. In my experiments (with
> >> load_catalog_in_background=false), it take hours to describe a large
> >> table. This statement is pretty cheap in Hive or Presto. Users may worry
> >> about whether impala is set up correctly.
> >>
> >>
> >> Can we add a more fine grain strategy about loading the metadata? For
> >> queries just hit one partition of a huge table, we don't need to load all
> >> the file descriptors as well.  For example, more levels to trigger metadata
> >> load:
> >> Level1. Load metadata from Hive MetaStore
> >> Level2. Load file descriptors of given partitions
> >> Level3. Load all file descriptors
> >>
> >>
> >> Then we can serve the following scenario better:
> >> 1. describe a large table
> >> 2. run query on one or several partitions of this table. (Each partition
> >> has few files)
> >>
> >>
> >> Do we have some discussion about this before?
> >>
> >>
> >> Thanks
> >> Quanlong
>
>
>
>
>

Re:Re: Load metadata exactly in need

Posted by Quanlong Huang <hu...@126.com>.
Hi Dimitris,


Thanks for your quick reply!


IMPALA-3127 is a great ticket. But it still has no progress and no assignee. Is it tracked in your internal Jira?


Hopes this can be done soon, since some users may choose Presto instead of Impala due to these usability cases.


Thanks
Quanlong

At 2017-09-12 12:17:23, "Dimitris Tsirogiannis" <dt...@cloudera.com> wrote:
>Hi Quanlong,
>
>You're right. The catalog needs to handle metadata at a finer granularity.
>We are actively looking into the options you mentioned as well as other
>related changes (see IMPALA-3234 and IMPALA-3127) to improve the
>performance and scalability of metadata management.
>
>Thanks
>Dimitris
>
>On Mon, Sep 11, 2017 at 8:51 PM, Quanlong Huang <hu...@126.com>
>wrote:
>
>> Hi all,
>>
>>
>> Currently if a "describe" statement hits an incomplete table, the impalad
>> will send an RPC request to the catalogd for loading metadata of this
>> table. It will take a long time for tables with many partitions and many
>> files. However, to serve the "describe" statement, we just need the
>> metadata in Hive MetaStore. In my experiments (with
>> load_catalog_in_background=false), it take hours to describe a large
>> table. This statement is pretty cheap in Hive or Presto. Users may worry
>> about whether impala is set up correctly.
>>
>>
>> Can we add a more fine grain strategy about loading the metadata? For
>> queries just hit one partition of a huge table, we don't need to load all
>> the file descriptors as well.  For example, more levels to trigger metadata
>> load:
>> Level1. Load metadata from Hive MetaStore
>> Level2. Load file descriptors of given partitions
>> Level3. Load all file descriptors
>>
>>
>> Then we can serve the following scenario better:
>> 1. describe a large table
>> 2. run query on one or several partitions of this table. (Each partition
>> has few files)
>>
>>
>> Do we have some discussion about this before?
>>
>>
>> Thanks
>> Quanlong

Re: Load metadata exactly in need

Posted by Dimitris Tsirogiannis <dt...@cloudera.com>.
Hi Quanlong,

You're right. The catalog needs to handle metadata at a finer granularity.
We are actively looking into the options you mentioned as well as other
related changes (see IMPALA-3234 and IMPALA-3127) to improve the
performance and scalability of metadata management.

Thanks
Dimitris

On Mon, Sep 11, 2017 at 8:51 PM, Quanlong Huang <hu...@126.com>
wrote:

> Hi all,
>
>
> Currently if a "describe" statement hits an incomplete table, the impalad
> will send an RPC request to the catalogd for loading metadata of this
> table. It will take a long time for tables with many partitions and many
> files. However, to serve the "describe" statement, we just need the
> metadata in Hive MetaStore. In my experiments (with
> load_catalog_in_background=false), it take hours to describe a large
> table. This statement is pretty cheap in Hive or Presto. Users may worry
> about whether impala is set up correctly.
>
>
> Can we add a more fine grain strategy about loading the metadata? For
> queries just hit one partition of a huge table, we don't need to load all
> the file descriptors as well.  For example, more levels to trigger metadata
> load:
> Level1. Load metadata from Hive MetaStore
> Level2. Load file descriptors of given partitions
> Level3. Load all file descriptors
>
>
> Then we can serve the following scenario better:
> 1. describe a large table
> 2. run query on one or several partitions of this table. (Each partition
> has few files)
>
>
> Do we have some discussion about this before?
>
>
> Thanks
> Quanlong