You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by mahender bigdata <Ma...@outlook.com> on 2016/02/24 01:37:30 UTC
Anyway to avoid creating subdirectories by "Insert with union”
Hi
Below insert with union will create sub-directories while executing in Tez.
set hive.execution.engine=tez;
insert overwrite table t3
select * from t1 limit 1
union
select * from t2 limit 2 ;
Is there anyway to avoid creating sub-directories while running in tez? Or this is by design and can not be changed?
We can have alternate work around has
insert overwrite table t3
select * from
( select * from t1 limit 1
union
select * from t2 limit 2) sub;
But above query involves in reading same results Twice. is there any setting, which disables directory creation while running in tez.
Thanks in advance.
Re: Anyway to avoid creating subdirectories by "Insert with union²
Posted by mahender bigdata <Ma...@outlook.com>.
Thanks Gopal will look into it
On 2/24/2016 4:26 PM, Gopal Vijayaraghavan wrote:
>> SET mapred.input.dir.recursive=TRUE;
> ...
>> Can we set above setting as tblProperties or Hive Table properties.
> Not directly, those are MapReduce properties - they are not settable via
> Hive tables.
>
> That said, you can write your own SemanticAnalyzerHooks to do pretty much
> anything you want like that.
>
> You can use hooks to modify the job, after tables have been resolved.
>
>
> Ideally such a hook should not modify the plan (much), because it's too
> late to do it right.
>
> But I sometimes prototype Hive optimizer features as Hooks, like this one.
>
> https://github.com/t3rmin4t0r/captain-hook
>
>
> Cheers,
> Gopal
>
>
Re: Anyway to avoid creating subdirectories by "Insert with union²
Posted by Mahender Sarangam <Ma...@outlook.com>.
HI Gopal,
Another question which i have is whenever we run Union All statement,
apart from Folders we also see Zero Bytes Files in HDFS. Are there locks
file (LCK) ?
Mahender
On 2/24/2016 4:26 PM, Gopal Vijayaraghavan wrote:
>> SET mapred.input.dir.recursive=TRUE;
> ...
>> Can we set above setting as tblProperties or Hive Table properties.
> Not directly, those are MapReduce properties - they are not settable via
> Hive tables.
>
> That said, you can write your own SemanticAnalyzerHooks to do pretty much
> anything you want like that.
>
> You can use hooks to modify the job, after tables have been resolved.
>
>
> Ideally such a hook should not modify the plan (much), because it's too
> late to do it right.
>
> But I sometimes prototype Hive optimizer features as Hooks, like this one.
>
> https://github.com/t3rmin4t0r/captain-hook
>
>
> Cheers,
> Gopal
>
>
Re: Anyway to avoid creating subdirectories by "Insert withunion²
Posted by Gopal Vijayaraghavan <go...@apache.org>.
> SET mapred.input.dir.recursive=TRUE;
...
> Can we set above setting as tblProperties or Hive Table properties.
Not directly, those are MapReduce properties - they are not settable via
Hive tables.
That said, you can write your own SemanticAnalyzerHooks to do pretty much
anything you want like that.
You can use hooks to modify the job, after tables have been resolved.
Ideally such a hook should not modify the plan (much), because it's too
late to do it right.
But I sometimes prototype Hive optimizer features as Hooks, like this one.
https://github.com/t3rmin4t0r/captain-hook
Cheers,
Gopal
Re: Anyway to avoid creating subdirectories by "Insert with union²
Posted by mahender bigdata <Ma...@outlook.com>.
Thanks Gopal. This is a architectural change from Hive 0.13 to hive
1.2. We are migrating our hive query from 0.13 to 1.2. Previously it is
running perfectly against 0.13 but same query in 1.2 is failing due to
union/union-all performance improvement. because of creation of sub
directories. Which forces to add
*_|SET hive.mapred.supports.subdirectories=TRUE; SET
mapred.input.dir.recursive=TRUE; |_**_|"hive.input.dir.recursive" =
"TRUE" |_*|*_"hive.supports.subdirectories" = "TRUE"._* We need to add to all our
HQLs which tough job and @ Cluster level, I cant set it since it causes
impact for hive query. Any setting at table level properties. ? |
Another question is Tez execution engine UNION ALL , Select * from (
select * from Tbl1 union all select * from Tbl2) as Tbl3,
So There will be no repeat of reading same data. Tez takes care of
pre-aggregation, whereas in MR it accessed data twice.
Can we set above setting as tblProperties or Hive Table properties.
/Mahender
On 2/23/2016 11:37 PM, Gopal Vijayaraghavan wrote:
>> Is there anyway to avoid creating sub-directories while running in tez?
>> Or this is by design and can not be changed?
> Yes, this is by design. The Tez execution of UNION is entirely parallel &
> the task-ids overlaps - so the files created have to have unique names.
>
> But the total counts for "Map 1" and "Map 2" are only available as the job
> runs, so they write to different dirs.
>
> Here's a comparison of MapReduce vs Tez (from 2014, some slides are out of
> date now).
>
> http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/15
>
>
> This UNION method is faster because of fewer intermediate HDFS writes &
> mapreduce.input.fileinputformat.input.dir.recursive=true kicks in as long
> as your cluster runs YARN (which it does, because otherwise Tez wouldn't
> work).
>
> Cheers,
> Gopal
>
>
Re: Anyway to avoid creating subdirectories by "Insert withunion²
Posted by Gopal Vijayaraghavan <go...@apache.org>.
>Is there anyway to avoid creating sub-directories while running in tez?
>Or this is by design and can not be changed?
Yes, this is by design. The Tez execution of UNION is entirely parallel &
the task-ids overlaps - so the files created have to have unique names.
But the total counts for "Map 1" and "Map 2" are only available as the job
runs, so they write to different dirs.
Here's a comparison of MapReduce vs Tez (from 2014, some slides are out of
date now).
http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/15
This UNION method is faster because of fewer intermediate HDFS writes &
mapreduce.input.fileinputformat.input.dir.recursive=true kicks in as long
as your cluster runs YARN (which it does, because otherwise Tez wouldn't
work).
Cheers,
Gopal