You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Jim Green <op...@gmail.com> on 2015/08/19 23:40:57 UTC

Tez : Anyway to avoid creating subdirectories by "Insert with union all” ?

Hi Team,

Below insert with union-all will create sub-directories:
set hive.execution.engine=tez;
create table h1_passwords_target like h1_passwords;

 insert overwrite table h1_passwords_target
 select * from
 (select * from h1_passwords limit 1
 union all
 select * from h1_passwords limit 2 ) sub;


[root@h1 h1_passwords_target]# ls -altr
total 2
drwxrwxrwx 115 xxx xxx 113 Aug 19 21:24 ..
drwxr-xr-x   2 xxx xxx   1 Aug 19 21:25 2
drwxr-xr-x   2 xxx xxx   1 Aug 19 21:25 1
drwxr-xr-x   4 xxx xxx   2 Aug 19 21:25 .

Is there anyway to avoid creating sub-directories? Or this is by design and
can not be changed?

Because non-Tez query by default they can not work fine since
hive.mapred.supports.subdirectories=false.

-- 
Thanks,
www.openkb.info
(Open KnowledgeBase for Hadoop/Database/OS/Network/Tool)

Re: Tez : Anyway to avoid creating subdirectories by "Insert with union all² ?

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Is there anyway to avoid creating sub-directories? Or this is by design
>and can not be changed?

This is because of the way file-formats generate hadoop name files without
collisions.

For instance, any change to that would break Parquet-MR for Tez. That's
why we generate a compatible, but colliding mapreduce.task.attempt.id
artificially for Tez jobs.

³Map 1² and ³Map 2² would both have an attempt 0 of task 1, generating
colliding file names (0001_0).

The easy workaround is a ³re-load² of the table.

insert overwrite table h1_passwords_target select * from
h1_passwords_target;


The slightly more complex one is to add a DISTRIBUTE BY & trigger a
reducer after the UNION ALL.

Cheers,
Gopal

Re: Tez : Anyway to avoid creating subdirectories by "Insert with union all² ?

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> Is there anyway to avoid creating sub-directories? Or this is by design
>and can not be changed?

This is because of the way file-formats generate hadoop name files without
collisions.

For instance, any change to that would break Parquet-MR for Tez. That's
why we generate a compatible, but colliding mapreduce.task.attempt.id
artificially for Tez jobs.

³Map 1² and ³Map 2² would both have an attempt 0 of task 1, generating
colliding file names (0001_0).

The easy workaround is a ³re-load² of the table.

insert overwrite table h1_passwords_target select * from
h1_passwords_target;


The slightly more complex one is to add a DISTRIBUTE BY & trigger a
reducer after the UNION ALL.

Cheers,
Gopal