You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2012/07/19 06:44:34 UTC

[jira] [Created] (HIVE-3276) optimize union sub-queries

Namit Jain created HIVE-3276:
--------------------------------

             Summary: optimize union sub-queries
                 Key: HIVE-3276
                 URL: https://issues.apache.org/jira/browse/HIVE-3276
             Project: Hive
          Issue Type: Bug
            Reporter: Namit Jain
            Assignee: Nadeem Moidu



It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.

For eg:

a query like:



insert overwrite table T1 partition P1
select * from 
(
  subq1
    union all
  subq2
) u;


today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
the final one for the union. 

It might be a good idea to optimize this. Instead of creating the union 
task, it might be simpler to create a move task (or something like a move
task), where the outputs of the two sub-queries will be moved to the final 
directory. This can easily extend to more than 2 sub-queries in the union.

This is only useful if there is a select * followed by filesink after the
union. This can be independently useful, and also be used to optimize the
skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475058#comment-13475058 ] 

Namit Jain commented on HIVE-3276:
----------------------------------

@Carl, comments addressed.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436202#comment-13436202 ] 

Namit Jain commented on HIVE-3276:
----------------------------------

@Kevin, this is ready for review.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Patch Available  (was: Open)

comments addressed
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain reassigned HIVE-3276:
--------------------------------

    Assignee: Namit Jain
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.13.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432909#comment-13432909 ] 

Namit Jain commented on HIVE-3276:
----------------------------------

https://reviews.facebook.net/D4623
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Description: 
It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.

For eg:

a query like:



insert overwrite table T1 partition P1
select * from 
(
  subq1
    union all
  subq2
) u;


today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
the final one for the union. 

It might be a good idea to optimize this. Instead of creating the union 
task, it might be simpler to create a move task (or something like a move
task), where the outputs of the two sub-queries will be moved to the final 
directory. This can easily extend to more than 2 sub-queries in the union.

This is very useful if there is a select * followed by filesink after the
union. This can be independently useful, and also be used to optimize the
skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.

If there is a select, filter between the union and the filesink, the select
and the filter can be moved before the union, and the follow-up job can
still be removed.

  was:
It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.

For eg:

a query like:



insert overwrite table T1 partition P1
select * from 
(
  subq1
    union all
  subq2
) u;


today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
the final one for the union. 

It might be a good idea to optimize this. Instead of creating the union 
task, it might be simpler to create a move task (or something like a move
task), where the outputs of the two sub-queries will be moved to the final 
directory. This can easily extend to more than 2 sub-queries in the union.

This is very useful if there is a select * followed by filesink after the
union. This can be independently useful, and also be used to optimize the
skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.

If there is a select, filter between the union and the filesink, the select and the filter can be moved before the union, and the follow-up job can 
still be removed.

    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.14.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, hive.3276.14.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Patch Available  (was: Open)

ShimLoader changes are copied from HIVE-3029, only to run tests on hadoop 23.
Once HIVE-3029 is checked in, this file will be reverted

Also, to run tests for thew newly added tests only on hadoop 23:

ant clean package


ant test -Dhadoop.mr.rev=23 -Dtest.print.classpath=true -Dhadoop.version=2.0.0-alpha -Dhadoop.security.version=2.0.0-alpha -Dtestcase=TestCliDriver -Dqfile=union_remove_1.q,union_remove_2.q,union_remove_3.q,union_remove_4.q,union_remove_5.q,union_remove_6.q,union_remove_7.q,union_remove_8.q,union_remove_9.q,union_remove_10.q,union_remove_11.q,union_remove_12.q,union_remove_13.q,union_remove_14.q,union_remove_15.q,union_remove_16.q,union_remove_17.q,union_remove_18.q

                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Work started] (HIVE-3276) optimize union sub-queries

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HIVE-3276 started by Nadeem Moidu.

> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Nadeem Moidu
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.5.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-3276:
--------------------------------

    Status: Open  (was: Patch Available)
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-3276:
--------------------------------

    Affects Version/s: 0.10.0
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, hive.3276.14.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13422533#comment-13422533 ] 

Nadeem Moidu commented on HIVE-3276:
------------------------------------

A wiki page has been added for the same here https://cwiki.apache.org/confluence/display/Hive/Union+Optimization . I have started work on this and will be uploading the patch soon. Feel free to give any feedback on the same. Thanks.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Nadeem Moidu
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.7.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.12.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13468682#comment-13468682 ] 

Namit Jain commented on HIVE-3276:
----------------------------------

The tests finished fine
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436105#comment-13436105 ] 

Kevin Wilfong commented on HIVE-3276:
-------------------------------------

Namit, is this ready for review?  You mention that more test need to be added, but the JIRA is marked Patch Available.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Patch Available  (was: Open)
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Patch Available  (was: Open)

addressed comments
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.9.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.6.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HIVE-3276) optimize union sub-queries

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nadeem Moidu reassigned HIVE-3276:
----------------------------------

    Assignee:     (was: Nadeem Moidu)
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>         Attachments: HIVE-3276.1.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Open  (was: Patch Available)
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13436176#comment-13436176 ] 

Namit Jain commented on HIVE-3276:
----------------------------------

I was able to run tests for hadoop 23.
I will upload the new patch soon.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.11.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Patch Available  (was: In Progress)
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467221#comment-13467221 ] 

Kevin Wilfong commented on HIVE-3276:
-------------------------------------

I posted a couple of minor comments and a bunch of questions in Phabricator.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Open  (was: Patch Available)

will address comments on phabricator
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Open  (was: Patch Available)

addressing comments on phabricator
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.2.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-3276:
--------------------------------

    Status: Open  (was: Patch Available)
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kevin Wilfong updated HIVE-3276:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.10.0
           Status: Resolved  (was: Patch Available)

Committed, thanks Namit.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, hive.3276.14.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Patch Available  (was: Open)

addressed comments - added new tests for double/bigint conversion
refreshed patch + test outputs
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, hive.3276.14.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487852#comment-13487852 ] 

Hudson commented on HIVE-3276:
------------------------------

Integrated in Hive-trunk-h0.21 #1767 (See [https://builds.apache.org/job/Hive-trunk-h0.21/1767/])
    HIVE-3276. optimize union sub-queries. (njain via kevinwilfong) (Revision 1403928)

     Result = FAILURE
kevinwilfong : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1403928
Files : 
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/ErrorMsg.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnInfo.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FilterOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRFileSink1.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRProcContext.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/unionproc/UnionProcContext.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/unionproc/UnionProcFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/unionproc/UnionProcessor.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/FileSinkDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/SelectDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoin_union_remove_1.q
* /hive/trunk/ql/src/test/queries/clientpositive/skewjoin_union_remove_2.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_1.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_10.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_11.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_12.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_13.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_14.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_15.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_16.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_17.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_18.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_19.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_2.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_20.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_21.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_22.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_23.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_24.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_3.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_4.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_5.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_6.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_7.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_8.q
* /hive/trunk/ql/src/test/queries/clientpositive/union_remove_9.q
* /hive/trunk/ql/src/test/results/clientpositive/skewjoin_union_remove_1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/skewjoin_union_remove_2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_10.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_11.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_12.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_13.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_14.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_15.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_16.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_17.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_18.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_19.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_20.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_21.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_22.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_23.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_24.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_5.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_6.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_7.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_8.q.out
* /hive/trunk/ql/src/test/results/clientpositive/union_remove_9.q.out

                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.10.0
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.10.0
>
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, hive.3276.14.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Patch Available  (was: Open)
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13445843#comment-13445843 ] 

Namit Jain commented on HIVE-3276:
----------------------------------

refreshed
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.8.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478349#comment-13478349 ] 

Kevin Wilfong commented on HIVE-3276:
-------------------------------------

I was referring to  HIVE-3544 in the above comment.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.10.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nadeem Moidu updated HIVE-3276:
-------------------------------

    Attachment: HIVE-3276.1.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Nadeem Moidu
>         Attachments: HIVE-3276.1.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Kevin Wilfong (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13486554#comment-13486554 ] 

Kevin Wilfong commented on HIVE-3276:
-------------------------------------

+1
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, hive.3276.12.patch, hive.3276.13.patch, hive.3276.14.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460188#comment-13460188 ] 

Namit Jain commented on HIVE-3276:
----------------------------------

All the tests ran fine
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13473935#comment-13473935 ] 

Carl Steinbach commented on HIVE-3276:
--------------------------------------

@Namit: I added two comments on phabricator. I'm looking at this pretty late so feel free to ignore them.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: hive.3276.10.patch, hive.3276.11.patch, HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch, hive.3276.5.patch, hive.3276.6.patch, hive.3276.7.patch, hive.3276.8.patch, hive.3276.9.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select
> and the filter can be moved before the union, and the follow-up job can
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Description: 
It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.

For eg:

a query like:



insert overwrite table T1 partition P1
select * from 
(
  subq1
    union all
  subq2
) u;


today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
the final one for the union. 

It might be a good idea to optimize this. Instead of creating the union 
task, it might be simpler to create a move task (or something like a move
task), where the outputs of the two sub-queries will be moved to the final 
directory. This can easily extend to more than 2 sub-queries in the union.

This is very useful if there is a select * followed by filesink after the
union. This can be independently useful, and also be used to optimize the
skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.

If there is a select, filter between the union and the filesink, the select and the filter can be moved before the union, and the follow-up job can 
still be removed.

  was:

It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.

For eg:

a query like:



insert overwrite table T1 partition P1
select * from 
(
  subq1
    union all
  subq2
) u;


today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
the final one for the union. 

It might be a good idea to optimize this. Instead of creating the union 
task, it might be simpler to create a move task (or something like a move
task), where the outputs of the two sub-queries will be moved to the final 
directory. This can easily extend to more than 2 sub-queries in the union.

This is only useful if there is a select * followed by filesink after the
union. This can be independently useful, and also be used to optimize the
skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is very useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html.
> If there is a select, filter between the union and the filesink, the select and the filter can be moved before the union, and the follow-up job can 
> still be removed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.3.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Nadeem Moidu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13427510#comment-13427510 ] 

Nadeem Moidu commented on HIVE-3276:
------------------------------------

The patch is partially complete. https://reviews.facebook.net/D4431
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Nadeem Moidu
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13428996#comment-13428996 ] 

Namit Jain commented on HIVE-3276:
----------------------------------

Looked at it in more detail.

It might be cleaner to add a optimization step for this, which changes the operator tree.
The current approach is simpler to quickly get the code out, but may not be a good idea in the long run.
We have tried to keep all the optimizations pluggable, it helps with roll-outs, fixing bugs slowly etc.
With the current approach, it is very difficult to make this pluggable. Again, it it possible to check
the new conf. in GenMRUnion1, but it looks like a hacky approach.
                
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Status: Open  (was: Patch Available)
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-3276) optimize union sub-queries

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-3276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-3276:
-----------------------------

    Attachment: hive.3276.4.patch
    
> optimize union sub-queries
> --------------------------
>
>                 Key: HIVE-3276
>                 URL: https://issues.apache.org/jira/browse/HIVE-3276
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: HIVE-3276.1.patch, hive.3276.2.patch, hive.3276.3.patch, hive.3276.4.patch
>
>
> It might be a good idea to optimize simple union queries containing map-reduce jobs in at least one of the sub-qeuries.
> For eg:
> a query like:
> insert overwrite table T1 partition P1
> select * from 
> (
>   subq1
>     union all
>   subq2
> ) u;
> today creates 3 map-reduce jobs, one for subq1, another for subq2 and 
> the final one for the union. 
> It might be a good idea to optimize this. Instead of creating the union 
> task, it might be simpler to create a move task (or something like a move
> task), where the outputs of the two sub-queries will be moved to the final 
> directory. This can easily extend to more than 2 sub-queries in the union.
> This is only useful if there is a select * followed by filesink after the
> union. This can be independently useful, and also be used to optimize the
> skewed joins https://cwiki.apache.org/Hive/skewed-join-optimization.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira