You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by mahender bigdata <Ma...@outlook.com> on 2016/02/24 01:37:30 UTC

Anyway to avoid creating subdirectories by "Insert with union”

Hi
Below insert with union will create sub-directories while executing in Tez.

  set hive.execution.engine=tez;

   insert overwrite table t3
    select * from t1 limit 1
     union
    select * from t2 limit 2 ;

Is there anyway to avoid creating sub-directories while running in tez? Or this is by design and can not be changed?

We can have alternate work around has
  
  insert overwrite table t3
  select * from
  (  select * from t1 limit 1
     union
    select * from t2 limit 2) sub;


But above query involves in reading same results Twice. is there any setting, which disables directory creation while running in tez.

Thanks in advance.

Re: Anyway to avoid creating subdirectories by "Insert with union²

Posted by mahender bigdata <Ma...@outlook.com>.

Thanks Gopal will look into it

On 2/24/2016 4:26 PM, Gopal Vijayaraghavan wrote:
>> SET mapred.input.dir.recursive=TRUE;
> ...
>> Can we set above setting as tblProperties or Hive Table properties.
> Not directly, those are MapReduce properties - they are not settable via
> Hive tables.
>
> That said, you can write your own SemanticAnalyzerHooks to do pretty much
> anything you want like that.
>
> You can use hooks to modify the job, after tables have been resolved.
>
>
> Ideally such a hook should not modify the plan (much), because it's too
> late to do it right.
>
> But I sometimes prototype Hive optimizer features as Hooks, like this one.
>
> https://github.com/t3rmin4t0r/captain-hook
>
>
> Cheers,
> Gopal
>
>

Re: Anyway to avoid creating subdirectories by "Insert with union²

Posted by Mahender Sarangam <Ma...@outlook.com>.

HI Gopal,

Another question which i have is whenever we run Union All statement, 
apart from Folders we also see Zero Bytes Files in HDFS. Are there locks 
file (LCK) ?

Mahender

On 2/24/2016 4:26 PM, Gopal Vijayaraghavan wrote:
>> SET mapred.input.dir.recursive=TRUE;
> ...
>> Can we set above setting as tblProperties or Hive Table properties.
> Not directly, those are MapReduce properties - they are not settable via
> Hive tables.
>
> That said, you can write your own SemanticAnalyzerHooks to do pretty much
> anything you want like that.
>
> You can use hooks to modify the job, after tables have been resolved.
>
>
> Ideally such a hook should not modify the plan (much), because it's too
> late to do it right.
>
> But I sometimes prototype Hive optimizer features as Hooks, like this one.
>
> https://github.com/t3rmin4t0r/captain-hook
>
>
> Cheers,
> Gopal
>
>

Re: Anyway to avoid creating subdirectories by "Insert withunion²

Posted by Gopal Vijayaraghavan <go...@apache.org>.

> SET mapred.input.dir.recursive=TRUE;
...
> Can we set above setting as tblProperties or Hive Table properties.

Not directly, those are MapReduce properties - they are not settable via
Hive tables.

That said, you can write your own SemanticAnalyzerHooks to do pretty much
anything you want like that.

You can use hooks to modify the job, after tables have been resolved.


Ideally such a hook should not modify the plan (much), because it's too
late to do it right.

But I sometimes prototype Hive optimizer features as Hooks, like this one.

https://github.com/t3rmin4t0r/captain-hook


Cheers,
Gopal

Re: Anyway to avoid creating subdirectories by "Insert with union²

Posted by mahender bigdata <Ma...@outlook.com>.

Thanks Gopal.  This is a  architectural change from Hive 0.13 to hive 
1.2. We are migrating our hive query from 0.13 to 1.2. Previously it is 
running perfectly against 0.13 but same query in 1.2 is failing due to 
union/union-all performance improvement. because of creation of sub 
directories. Which forces to add

*_|SET hive.mapred.supports.subdirectories=TRUE; SET 
mapred.input.dir.recursive=TRUE; |_**_|"hive.input.dir.recursive" = 
"TRUE" |_*|*_"hive.supports.subdirectories" = "TRUE"._* We need to add to all our 
HQLs which tough job and @ Cluster level, I cant set it since it causes 
impact for hive query. Any setting at table level properties. ? |

Another question is Tez execution engine UNION ALL ,  Select * from ( 
select * from Tbl1 union all select * from Tbl2)  as Tbl3,

So There will be no repeat of reading same data. Tez takes care of 
pre-aggregation, whereas in MR it accessed data twice.

Can we set above setting as tblProperties or Hive Table properties.

/Mahender

On 2/23/2016 11:37 PM, Gopal Vijayaraghavan wrote:
>> Is there anyway to avoid creating sub-directories while running in tez?
>> Or this is by design and can not be changed?
> Yes, this is by design. The Tez execution of UNION is entirely parallel &
> the task-ids overlaps - so the files created have to have unique names.
>
> But the total counts for "Map 1" and "Map 2" are only available as the job
> runs, so they write to different dirs.
>
> Here's a comparison of MapReduce vs Tez (from 2014, some slides are out of
> date now).
>
> http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/15
>
>
> This UNION method is faster because of fewer intermediate HDFS writes &
> mapreduce.input.fileinputformat.input.dir.recursive=true kicks in as long
> as your cluster runs YARN (which it does, because otherwise Tez wouldn't
> work).
>
> Cheers,
> Gopal
>
>

Re: Anyway to avoid creating subdirectories by "Insert withunion²

Posted by Gopal Vijayaraghavan <go...@apache.org>.

>Is there anyway to avoid creating sub-directories while running in tez?
>Or this is by design and can not be changed?

Yes, this is by design. The Tez execution of UNION is entirely parallel &
the task-ids overlaps - so the files created have to have unique names.

But the total counts for "Map 1" and "Map 2" are only available as the job
runs, so they write to different dirs.

Here's a comparison of MapReduce vs Tez (from 2014, some slides are out of
date now).

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/15


This UNION method is faster because of fewer intermediate HDFS writes &
mapreduce.input.fileinputformat.input.dir.recursive=true kicks in as long
as your cluster runs YARN (which it does, because otherwise Tez wouldn't
work).

Cheers,
Gopal