You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2018/10/31 18:01:00 UTC

[jira] [Updated] (IMPALA-3607) Reduce test data loading time from snapshot

     [ https://issues.apache.org/jira/browse/IMPALA-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong updated IMPALA-3607:
----------------------------------
    Issue Type: Improvement  (was: Bug)

> Reduce test data loading time from snapshot
> -------------------------------------------
>
>                 Key: IMPALA-3607
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3607
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Infrastructure
>    Affects Versions: Impala 2.5.0
>            Reporter: Dimitris Tsirogiannis
>            Priority: Minor
>              Labels: test-infra
>
> Loading test data from snapshot takes a significant amount of time (~20-30min). Given the amount of data loaded (~4GB), the process of loading test data to a local 3-node min-hdfs cluster should be significantly faster. The process currently works as follows:
> 1. Download the latest snapshot 
> 2. Unzip 
> 3. Use hdfs dfs -put command to copy from local file system to hdfs
> We believe the bulk of the time goes to step #3 and is attributed to namenode overhead. Below are a few ideas we can try to improve this:
> 1. Use a backup and restore approach for hdfs metadata/data that doesn't go through the namenode. For example, once data is loaded to an hdfs cluster using the old approach create two snapshots, one for metadata and one for data. Loading the test data is just a matter of unzipping the snapshots to the appropriate directories. A similar approach is used to backup and restore hdfs clusters (http://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_hdfs_metadata_backup.html). A jenkins job would still be responsible for checking for changes in test data, do the slow data loading and creating the new snapshots. 
> 2. Other ideas include the use of EC2 AMIs, docker and/or hdfs checkpointing. 
> 3. Use faster compression/decompression tools.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org