You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2018/10/09 20:13:00 UTC

[jira] [Commented] (IMPALA-7627) Parallel the fetching permission process

    [ https://issues.apache.org/jira/browse/IMPALA-7627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16644035#comment-16644035 ] 

Todd Lipcon commented on IMPALA-7627:
-------------------------------------

I also found that this was a very expensive part of metadata loading. In the end, it's only used when performing preflight checks on INSERT queries to ensure that the target partitions are writable, so for a lot of use cases the work is entirely unnecessary.

In IMPALA-7321 I suggested that we might remove this stuff entirely. An alternate would be to defer the work until planning an INSERT query, and only check the partitions which might be targets. In the case of a dynamic partitioned insert, we'd still have to check all partitions, which would be expensive, but maybe it's worth it to avoid the cost at table-load time?

Another alternative suggested there is to only check the top-level of the table rather than every partition, in the case that the partitions are all in "standard" locations. If someone has managed to chmod one of the partition subdirectories inappropriately, we'll just fail at runtime instead of planning time. So, in the common case where people aren't manually messing with partitions inside their warehouse directory, we can just avoid all this work entirely.

I'm not against parallelizing as suggested in this JIRA, but it's probably a good time to evaluate the above alternatives.

> Parallel the fetching permission process
> ----------------------------------------
>
>                 Key: IMPALA-7627
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7627
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Peikai Zheng
>            Assignee: Peikai Zheng
>            Priority: Major
>
> There are three phases when the Catalogd loading the metadata of a table.
>  Firstly, the Catalogd fetches the metadata from Hive metastore;
>  Then, the Catalogd fetches the permission of each partition from HDFS NameNode;
>  Finally, the Catalogd loads the file descriptor from HDFS NameNode.
> According to my test result(Based on commit *11554a17c75b242767d5a50d66bc2874aa545c77*):
> ||Average Time(GetFileInfoThread=10)||phase 1||phase 2||phase 3||
> |idm.sauron_message|9.9917115|459.2106944|95.0179163|
> |default.revenue_enriched|12.3377474|111.2969046|40.827472|
> |default.upp_raw_prod|1.5143162|50.0251426|12.6805323|
> |default.hit_to_beacon_playback_prod|1.4294509|49.7670539|18.3557858|
> |default.sitetracking_enriched|13.0003804|112.8746656|42.1824032|
> |default.player_custom_event|9.2618705|493.4865302|116.4986184|
> |default.revenue_day_est|57.9116561|106.5028664|24.005822|
>  Detailed Information of tables:
> ||Table||#Partitions||#Files||Size(without replica) / TB||Size(with replica) / TB||
> |idm.sauron_message|12923|69537|44.4|90.3|
> |default.revenue_enriched|1809|1832001|145.5|308.6|
> |default.upp_raw_prod|801|480000|186.3|424|
> |default.hit_to_beacon_playback_prod|777|793900|46.6|139.9|
> |default.sitetracking_enriched|1809|1842049|21.7|65|
> |default.player_custom_event|8816|2197096|47.2|141.5|
> |default.revenue_day_est|1731|109815|25.9|77.6|
> So, I suggest to parallel the second phase.The majority of the time occupied by the second phase.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org