You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "hefuhua (JIRA)" <ji...@apache.org> on 2017/07/17 07:27:00 UTC

[jira] [Updated] (HIVE-17104) Hive ynamic partition loading is too slow

     [ https://issues.apache.org/jira/browse/HIVE-17104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

hefuhua updated HIVE-17104:
---------------------------
    Status: Patch Available  (was: Open)

  when moving data to dest directory , function inheritFromTable, sets all paths from tablePath to destf (including destf) to have same permission as tablePath.
  Hive change all the file's group, acl and permission one by one , which costs too much time, if there are thousands of file , this function will be called thousands times. An efficient way is only execute this function one time , change all the file's group, acl and permission the same permission with tablePath.
  If dynamic partitions have more one partition, get the first partition's data directory and only change this directory's group,acl and permission will work . for example:
  insert overwrite table report.data_security_lab partition(stat_date,log_id)
  select * from table ....;
  target location is {color:red}/user/hive/warehouse/report.db/data_security_lab/stat_date=xxx{color}/log_id=xxx.
only need change directory /user/hive/warehouse/report.db/data_security_lab/stat_date=xxx group , acl and permission .
change following code:
Set<Path> destPaths = new HashSet<>();
while (iter.hasNext()) {
        // get the dynamically created directory
        Path partPath = iter.next();
        assert fs.getFileStatus(partPath).isDir():
          "partitions " + partPath + " is not a directory !";
        // generate a full partition specification
        LinkedHashMap<String, String> fullPartSpec = new LinkedHashMap<String, String>(partSpec);
        Warehouse.makeSpecFromName(fullPartSpec, partPath);

        // sets all paths from tablePath to destf (including destf) to have same permission as tablePath.
        // start
        if (fullPartSpec.size() >= 1) {
          Path tblDataLocationPath = tbl.getDataLocation();
          Map<String, String> firstPartition = new HashedMap();
          Map.Entry<String, String> entry = fullPartSpec.entrySet().iterator().next();
          String key = entry.getKey();
          String value = entry.getValue();
          firstPartition.put(key, value);
          Path partDataPath = new Path(tbl.getDataLocation(), Warehouse.makePartPath(firstPartition));
          Path newPartPath = new Path(tblDataLocationPath.toUri().getScheme(),
                  tblDataLocationPath.toUri().getAuthority(), partDataPath.toUri().getPath());
          if (!destPaths.contains(newPartPath)){
            destPaths.add(newPartPath);
          }
        }
        // end
        Partition newPartition = loadPartition(partPath, tbl, fullPartSpec, replace,
            holdDDLTime, true, listBucketingEnabled, false, isAcid);
        partitionsMap.put(fullPartSpec, newPartition);
        LOG.info("New loading path = " + partPath + " with partSpec " + fullPartSpec);
      }
      long startTime = System.currentTimeMillis();
      boolean inheritPerms = HiveConf.getBoolVar(conf,
              HiveConf.ConfVars.HIVE_WAREHOUSE_SUBDIR_INHERIT_PERMS);
      for (Path newPartPath : destPaths){
        FileSystem destFs = newPartPath.getFileSystem(conf);
        if (inheritPerms) {
          inheritFromTable(tbl.getPath(), newPartPath, conf, destFs);
        }
      }

> Hive ynamic partition loading is too slow
> -----------------------------------------
>
>                 Key: HIVE-17104
>                 URL: https://issues.apache.org/jira/browse/HIVE-17104
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>    Affects Versions: 1.2.2, 1.2.1
>         Environment: apache
>            Reporter: hefuhua
>
> Taking too much time for loading dynamic partitions when i use hive dynamic partition.
> Hql :
> set fs.defaultFS=hdfs://yq01-ns2;
> use tmp_security_lab;
> add file hdfs://yq01-ns1/user/hive/warehouse-work/script/transform_security_lab.py ;
> set hive.auto.convert.join=false;
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.max.dynamic.partitions=100000;
> SET hive.exec.max.created.files=200000;
> SET hive.exec.max.dynamic.partitions.pernode=10000;
> set hive.groupby.orderby.position.alias = true;
> set hive.exec.parallel=true;
> set mapreduce.input.fileinputformat.split.maxsize=128000000;
> set mapreduce.input.fileinputformat.split.minsize=128000000;
> set mapred.min.split.size.per.node=128000000;
> set mapred.min.split.size.per.rack=128000000;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set hive.hadoop.supports.splittable.combineinputformat=true;
> set hive.merge.mapfiles=true;
> set hive.merge.mapredfiles=true;
> set hive.merge.size.per.task=256000000;
> set hive.merge.smallfiles.avgsize=256000000;
> SET mapred.output.compression.type=BLOCK;
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
> set mapreduce.reduce.memory.mb=16384;
> set hive.exec.reducers.max=2000;
> set mapreduce.reduce.java.opts=-Xmx14000m -Xms1800m;
> insert overwrite table report.data_security_lab partition(stat_date,log_id)
> select
>     app_name,
>     pkg,
>     pkg_version,
>     today.cuid,
>     time_s,
>     policy_id,
>     app_key,
>     sdk_name,
>     sdk_version,
>     client_version,
>     lc,
>     host_is_legal,
>     is_wifi,
>     server_time,
>     client_ip,
>     msg_id,
>     d1,
>     d2,
>     build_board,
>     build_device,
>     build_hardware,
>     build_host,
>     build_id,
>     build_product,
>     build_v_codename,
>     build_v_incremental,
>     manufactory,
>     product_module,
>     resolution,
>     rom,
>     uid,
>     imsi,
>     mnc,
>     d3,
>     20170630,
>     today.log_id
> from(
>     select
>         transform(d)
>         USING 'python transform_security_lab.py' as (
>             app_name string,
>             pkg string,
>             pkg_version string,
>             cuid string,
>             log_id string,
>             time_s string,
>             policy_id string,
>             app_key string,
>             sdk_name string,
>             sdk_version string,
>             client_version string,
>             lc string,
>             host_is_legal string,
>             is_wifi string,
>             server_time string,
>             client_ip string,
>             msg_id string,
>             d1 map<string,string>,
>             d2 map<string,string>
>         )
>     from (
>         select d from tmp_security_lab.yq_security_lab
>         where stat_date = 20170630
>         and get_json_object(d,'$.5') != 1001001
>         and get_json_object(d,'$.5') is not null
>         and length(get_json_object(d,'$.5')) in (4,7)
>         and from_unixtime(bigint(get_json_object(d,'$.101')),'yyyyMMdd') = 20170630
>     ) a
> ) today
> left outer join (
>     select
>         build_board,build_device,build_hardware,build_host,build_id,build_product,
>         build_v_codename,build_v_incremental,cuid,manufactory,product_module,
>         resolution,rom,uid,imsi,mnc,d3
>     from report.data_security_lab_hd
>     where stat_date=20170630 and log_id = 1001001
> ) hdinfo
> on today.cuid = hdinfo.cuid
> where length(today.log_id) in (4,7)
> log:
> 2017-07-14 12:39:05,958 Stage-5 map = 99%,  reduce = 0%, Cumulative CPU 16597.67 sec
> 2017-07-14 12:39:35,829 Stage-5 map = 100%,  reduce = 0%, Cumulative CPU 16619.58 sec
> MapReduce Total cumulative CPU time: 0 days 4 hours 36 minutes 59 seconds 580 msec
> Ended Job = job_1495521521755_1559679
> Moving data to: hdfs://yq01-ns2/user/hive/tmp/work/.hive-staging_hive_2017-07-14_11-31-21_072_907112456105752978-1/-ext-10000/stat_date=20170630/log_id=%2500%2500%2518%2500003
> Moving data to: hdfs://yq01-ns2/user/hive/tmp/work/.hive-staging_hive_2017-07-14_11-31-21_072_907112456105752978-1/-ext-10000/stat_date=20170630/log_id=%2519%2505%2505%2506
> Moving data to: hdfs://yq01-ns2/user/hive/tmp/work/.hive-staging_hive_2017-07-14_11-31-21_072_907112456105752978-1/-ext-10000/stat_date=20170630/log_id=1004102
> Moving data to: hdfs://yq01-ns2/user/hive/tmp/work/.hive-staging_hive_2017-07-14_11-31-21_072_907112456105752978-1/-ext-10000/stat_date=20170630/log_id=1026103
> Moving data to: hdfs://yq01-ns2/user/hive/tmp/work/.hive-staging_hive_2017-07-14_11-31-21_072_907112456105752978-1/-ext-10000/stat_date=20170630/log_id=1026104
> Loading data to table report.data_security_lab partition (stat_date=null, log_id=null)
> 	{color:red} Time taken for load dynamic partitions : {color:red}12210247{color}{color}
> 	Loading partition {stat_date=20170630, log_id=1001121}
> 	Loading partition {stat_date=20170630, log_id=1012101}
> 	Loading partition {stat_date=20170630, log_id=1008105}
> 	Loading partition {stat_date=20170630, log_id=1003126}
> 	Loading partition {stat_date=20170630, log_id=1025101}
> 	Loading partition {stat_date=20170630, log_id=1027003}
> 	Loading partition {stat_date=20170630, log_id=1003117}
> 	Loading partition {stat_date=20170630, log_id=2001104}
> 	Loading partition {stat_date=20170630, log_id=1001003}
> Total time taken about 4 hours, but load dynamic partitions take more than 3 hours.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)