You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "He Yongqiang (JIRA)" <ji...@apache.org> on 2011/06/14 04:53:47 UTC

[jira] [Created] (HIVE-2218) speedup addInputPaths

speedup addInputPaths
---------------------

                 Key: HIVE-2218
                 URL: https://issues.apache.org/jira/browse/HIVE-2218
             Project: Hive
          Issue Type: Improvement
            Reporter: He Yongqiang
            Assignee: He Yongqiang


Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.

This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2218) speedup addInputPaths

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050144#comment-13050144 ] 

Ning Zhang commented on HIVE-2218:
----------------------------------

+1

> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch, HIVE-2218.2.patch, HIVE-2218.3.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2218) speedup addInputPaths

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2218:
-------------------------------

    Status: Patch Available  (was: Open)

> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2218) speedup addInputPaths

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2218:
-------------------------------

    Attachment: HIVE-2218.2.patch

> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch, HIVE-2218.2.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2218) speedup addInputPaths

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2218:
-------------------------------

    Attachment: HIVE-2218.3.patch

use path.getParent instead of writing code to get the parent of a path

> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch, HIVE-2218.2.patch, HIVE-2218.3.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2218) speedup addInputPaths

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050030#comment-13050030 ] 

jiraposter@reviews.apache.org commented on HIVE-2218:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/898/#review846
-----------------------------------------------------------



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/898/#comment1839>

    If we change accept(), we need to remove + File.separator here.



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/898/#comment1837>

    I think it would be better to use path.getParent().toString() instead of doing string manipulation. 


- Ning


On 2011-06-15 20:24:15, Yongqiang He wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/898/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-06-15 20:24:15)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  speedup addInputPaths
bq.  
bq.  
bq.  This addresses bug HIVE-2218.
bq.      https://issues.apache.org/jira/browse/HIVE-2218
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335 
bq.  
bq.  Diff: https://reviews.apache.org/r/898/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  yes.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Yongqiang
bq.  
bq.



> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch, HIVE-2218.2.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2218) speedup addInputPaths

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049963#comment-13049963 ] 

jiraposter@reviews.apache.org commented on HIVE-2218:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/898/#review842
-----------------------------------------------------------



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<https://reviews.apache.org/r/898/#comment1826>

    This change will make the order of paths in pathProcessed non-deterministic. This means mapred.input.dir will have not have the same order as before. Not sure if it is safe or not, but if you change HashSet with LinkedHashSet, the order will be preserved.



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<https://reviews.apache.org/r/898/#comment1828>

    Here needs some comments why this case doesn't need to check empty paths. 
    
    In terms of efficiency, it seems to me that checking empty paths is not the most expensive part (# of RPCs is large but each listStatus() should be fast). Also we should be able to cache (needs to extend Utilities.isEmpty) the results of listStatus for each path, which are anyway needed in other operations (compute splits etc). If 



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
<https://reviews.apache.org/r/898/#comment1829>

    It seem that we are doing redundant work as FileInputFormat.setInputPaths(JobConf, CommaSeparatedString). I think it would be safer and cleaner to first get an array of paths and call: 
    
    FileInputFormat.setInputPaths(StringUtils.stringToPath(String[] paths))



trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
<https://reviews.apache.org/r/898/#comment1830>

    indentation



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/898/#comment1833>

    why do we need it here?



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/898/#comment1834>

    ! mrwork.getPartDescToRework().isEmpty()



trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java
<https://reviews.apache.org/r/898/#comment1835>

    The logic here is too complex and I think it's better to be refactored. Is the following what you wanted? 
    
    if (all_partitions_are_rework()) {
      prepareNullCombineFilter(combine);
    } else {
      prepareNormalCombineFilter(combine);
    } 
    InputSplitShim[] iss = combine.getSplits()



trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java
<https://reviews.apache.org/r/898/#comment1832>

    Does here just need a HashSet<PartitionDesc> rather than a HashMap<PartitionDesc, Boolean>?


- Ning


On 2011-06-14 21:09:13, Yongqiang He wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/898/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-06-14 21:09:13)
bq.  
bq.  
bq.  Review request for hive.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  speedup addInputPaths
bq.  
bq.  
bq.  This addresses bug HIVE-2218.
bq.      https://issues.apache.org/jira/browse/HIVE-2218
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1135335 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335 
bq.    trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1135335 
bq.  
bq.  Diff: https://reviews.apache.org/r/898/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  yes.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Yongqiang
bq.  
bq.



> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2218) speedup addInputPaths

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-2218:
-------------------------------

    Attachment: HIVE-2218.1.patch

> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HIVE-2218) speedup addInputPaths

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-2218:
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

Committed. Thanks Yongqiang!

> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2218.1.patch, HIVE-2218.2.patch, HIVE-2218.3.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2218) speedup addInputPaths

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054060#comment-13054060 ] 

Hudson commented on HIVE-2218:
------------------------------

Integrated in Hive-trunk-h0.21 #790 (See [https://builds.apache.org/job/Hive-trunk-h0.21/790/])
    

> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2218.1.patch, HIVE-2218.2.patch, HIVE-2218.3.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2218) speedup addInputPaths

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049423#comment-13049423 ] 

He Yongqiang commented on HIVE-2218:
------------------------------------

https://reviews.apache.org/r/898/

> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2218) speedup addInputPaths

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050003#comment-13050003 ] 

jiraposter@reviews.apache.org commented on HIVE-2218:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/898/
-----------------------------------------------------------

(Updated 2011-06-15 20:24:15.631412)


Review request for hive.


Changes
-------

address Ning's comments. Did the minimum change and the performance is acceptable. We can try to remove empty path check if in future we see the latency is not good.


Summary
-------

speedup addInputPaths


This addresses bug HIVE-2218.
    https://issues.apache.org/jira/browse/HIVE-2218


Diffs (updated)
-----

  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335 

Diff: https://reviews.apache.org/r/898/diff


Testing
-------

yes.


Thanks,

Yongqiang



> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HIVE-2218) speedup addInputPaths

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-2218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049426#comment-13049426 ] 

jiraposter@reviews.apache.org commented on HIVE-2218:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/898/
-----------------------------------------------------------

Review request for hive.


Summary
-------

speedup addInputPaths


This addresses bug HIVE-2218.
    https://issues.apache.org/jira/browse/HIVE-2218


Diffs
-----

  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1135335 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1135335 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 1135335 
  trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 1135335 

Diff: https://reviews.apache.org/r/898/diff


Testing
-------

yes.


Thanks,

Yongqiang



> speedup addInputPaths
> ---------------------
>
>                 Key: HIVE-2218
>                 URL: https://issues.apache.org/jira/browse/HIVE-2218
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: HIVE-2218.1.patch
>
>
> Speedup the addInputPaths for combined symlink inputformat, and added some other micro optimizations which also work for normal cases.
> This can help reducing the start time of one query from 5 hours to less than 20 mins.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira