You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@impala.apache.org by Antoni Ivanov <an...@gmail.com> on 2018/03/22 19:04:50 UTC

Large number of partitions warning

Hi, 

I was reading https://www.cloudera.com/documentation/enterprise/5-11-x/topics/impala_partitioning.html#partition_stats 
And noticed this warning "If this metadata for all tables combined exceeds 2 GB, you might experience service downtime.”

But I thought that this limit applies for single table. Having large number of partitions per table can be an issue because each metadata operation for this table would require the whole metadata be updated (as far as I understand Impala doesn’t update partially the metadata for only the changed partitions) Also because of Java serialization limitation where you cannot serialize it to more than 1G or 2G 
I am guessing this is related to https://issues.apache.org/jira/browse/IMPALA-5058 - but that’s only if you do frequent DDL operations I suppose. 

Am I understanding things so far correctly. So why can there be service downtime in this case ?

Thanks,
Antoni

Re: Large number of partitions warning

Posted by Antoni Ivanov <ai...@vmware.com>.

Thank for the response. 

Does that mean before v.2.12 if one uses load_catalog_in_background option to load the metadata (on demand) incrementally in background is unlikely to hit an an issue. 

Also I am guessing it has an impact on DDL performance because the catalog updates (say DDL statement or adding partitions) would take a bit more time with bigger metadata especially with IMPALA-5058 because there would be increased lock contention (for the getCatalogObjects) even if each one take only small amout of time slower ?



On 3/22/18, 9:14 PM, "Dimitris Tsirogiannis" <dt...@cloudera.com> wrote:

    Hi Antoni,
    
    First of all, let me say that indeed that statement is kind of confusing
    and in some cases wrong.
    
    The best way to think of the 2GB Java limit is with respect to the amount
    of serialized metadata that the catalog server needs to send at any point
    in time. There are operations that require a single table to be serialized,
    so in that case the 2GB limit is "applied" at the table level. However,
    there are other operations, such as sending a catalog topic update to the
    statestore. In that case, the catalog may serialize multiple tables
    (depending on whether there was a change in metadata since the last update)
    into a single Thrift message. Hence, in this case the limit is applied to
    the cumulative size of serialized table metadata. That said, in Impala
    v2.12 we're changing the behavior so that the 2GB limit is always applied
    at a single table.
    
    Hope that helps.
    Dimitris
    
    On Thu, Mar 22, 2018 at 12:04 PM, Antoni Ivanov <an...@gmail.com> wrote:
    
    > Hi,
    >
    > I was reading https://urldefense.proofpoint.com/v2/url?u=https-3A__www.cloudera.com_documentation_enterprise_5-2D11-2D&d=DwIFaQ&c=uilaK90D4TOVoH58JNXRgQ&r=8j7tFznmoySfY9RiEAZHgvi3kzcbJ_Zy6Hp9HYX4dDE&m=jtsSPRqJ13BcXd8BqA3RMj9x7Hni24nOtcz4yVMgG30&s=32fFHQs-e5H2KjE_P6RVlmapbPg9-Zk8OaI-m8stEv0&e=
    > x/topics/impala_partitioning.html#partition_stats
    > And noticed this warning "If this metadata for all tables combined exceeds
    > 2 GB, you might experience service downtime.”
    >
    > But I thought that this limit applies for single table. Having large
    > number of partitions per table can be an issue because each metadata
    > operation for this table would require the whole metadata be updated (as
    > far as I understand Impala doesn’t update partially the metadata for only
    > the changed partitions) Also because of Java serialization limitation where
    > you cannot serialize it to more than 1G or 2G
    > I am guessing this is related to https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_&d=DwIFaQ&c=uilaK90D4TOVoH58JNXRgQ&r=8j7tFznmoySfY9RiEAZHgvi3kzcbJ_Zy6Hp9HYX4dDE&m=jtsSPRqJ13BcXd8BqA3RMj9x7Hni24nOtcz4yVMgG30&s=MNcc4Uarowr_mGlWOJbZqfkM9yPZhe6NyxdZ6VUTwdI&e=
    > jira/browse/IMPALA-5058 - but that’s only if you do frequent DDL
    > operations I suppose.
    >
    > Am I understanding things so far correctly. So why can there be service
    > downtime in this case ?
    >
    > Thanks,
    > Antoni

Re: Large number of partitions warning

Posted by Dimitris Tsirogiannis <dt...@cloudera.com>.

Hi Antoni,

First of all, let me say that indeed that statement is kind of confusing
and in some cases wrong.

The best way to think of the 2GB Java limit is with respect to the amount
of serialized metadata that the catalog server needs to send at any point
in time. There are operations that require a single table to be serialized,
so in that case the 2GB limit is "applied" at the table level. However,
there are other operations, such as sending a catalog topic update to the
statestore. In that case, the catalog may serialize multiple tables
(depending on whether there was a change in metadata since the last update)
into a single Thrift message. Hence, in this case the limit is applied to
the cumulative size of serialized table metadata. That said, in Impala
v2.12 we're changing the behavior so that the 2GB limit is always applied
at a single table.

Hope that helps.
Dimitris

On Thu, Mar 22, 2018 at 12:04 PM, Antoni Ivanov <an...@gmail.com> wrote:

> Hi,
>
> I was reading https://www.cloudera.com/documentation/enterprise/5-11-
> x/topics/impala_partitioning.html#partition_stats
> And noticed this warning "If this metadata for all tables combined exceeds
> 2 GB, you might experience service downtime.”
>
> But I thought that this limit applies for single table. Having large
> number of partitions per table can be an issue because each metadata
> operation for this table would require the whole metadata be updated (as
> far as I understand Impala doesn’t update partially the metadata for only
> the changed partitions) Also because of Java serialization limitation where
> you cannot serialize it to more than 1G or 2G
> I am guessing this is related to https://issues.apache.org/
> jira/browse/IMPALA-5058 - but that’s only if you do frequent DDL
> operations I suppose.
>
> Am I understanding things so far correctly. So why can there be service
> downtime in this case ?
>
> Thanks,
> Antoni