You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2011/06/07 02:54:58 UTC
[jira] [Created] (HIVE-2201) remove name node calls in hive by
creating temporary directories
remove name node calls in hive by creating temporary directories
----------------------------------------------------------------
Key: HIVE-2201
URL: https://issues.apache.org/jira/browse/HIVE-2201
Project: Hive
Issue Type: Improvement
Reporter: Namit Jain
Currently, in Hive, when a file gets written by a FileSinkOperator,
the sequence of operations is as follows:
1. In tmp directory tmp1, create a tmp file _tmp_1
2. At the end of the operator, move
/tmp1/_tmp_1 to /tmp1/1
3. Move directory /tmp1 to /tmp2
4. For all files in /tmp2, remove all files starting with _tmp and
duplicate files.
Due to speculative execution, a lot of temporary files are created
in /tmp1 (or /tmp2). This leads to a lot of name node calls,
specially for large queries.
The protocol above can be modified slightly:
1. In tmp directory tmp1, create a tmp file _tmp_1
2. At the end of the operator, move
/tmp1/_tmp_1 to /tmp2/1
3. Move directory /tmp2 to /tmp3
4. For all files in /tmp3, remove all duplicate files.
This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068715#comment-13068715 ]
jiraposter@reviews.apache.org commented on HIVE-2201:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/952/
-----------------------------------------------------------
(Updated 2011-07-20 23:31:54.007436)
Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
Changes
-------
1. change block merge task too
2. change the capital file name
Summary
-------
reduce name node calls in hive by creating temporary directories
This addresses bug HIVE-2201.
https://issues.apache.org/jira/browse/HIVE-2201
Diffs (updated)
-----
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1148905
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1148905
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1148905
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java 1148905
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/merge/BlockMergeTask.java 1148905
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/merge/RCFileMergeMapper.java 1148905
Diff: https://reviews.apache.org/r/952/diff
Testing
-------
Thanks,
Siying
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch, HIVE-2201.4.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Attachment: HIVE-2201.4.patch
1. change block merge task too
2. change the capital file name
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch, HIVE-2201.4.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Steinbach updated HIVE-2201:
---------------------------------
Component/s: Query Processor
Fix Version/s: 0.8.0
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor
> Reporter: Namit Jain
> Assignee: Siying Dong
> Fix For: 0.8.0
>
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch, HIVE-2201.4.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Attachment: HIVE-2201.2.patch
fix a bug.
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068839#comment-13068839 ]
Hudson commented on HIVE-2201:
------------------------------
Integrated in Hive-trunk-h0.21 #839 (See [https://builds.apache.org/job/Hive-trunk-h0.21/839/])
HIVE-2201:reduce name node calls in hive by creating temporary directories (Siying Dong via He Yongqiang)
heyongqiang : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1149047
Files :
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/merge/RCFileMergeMapper.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/rcfile/merge/BlockMergeTask.java
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch, HIVE-2201.4.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054501#comment-13054501 ]
He Yongqiang commented on HIVE-2201:
------------------------------------
can you create a review board?
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Status: Patch Available (was: In Progress)
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054508#comment-13054508 ]
He Yongqiang commented on HIVE-2201:
------------------------------------
1. why do you need to change "RCFileOutputFormat"
2. have you tested this code?
3. in Utilities, why tmpPath is still there together with taskTmpPath? And 2 toTaskTempPath(), keep one
4. CreateTmpDirs, lower case, and CreateTmpDirs use 2 loop, use values(). why is this even needed?
5. I think not just FileSinkOperator have this logic, you should change them all
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Attachment: HIVE-2201.3.patch
According to Hairong Kuang, Hadoop's behavior for creating a new file is that it will automatically create it's parent directory if it doesn't exist. In that case, I removed the directory check and create part when writing to a new file.
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Attachment: HIVE-2201.1.patch
Implemented the logic.
Discovered one problem: when moving from /tmp1/_tmp_1 to /tmp2/1, we might need to check whether /tmp2 exists before moving it. This patch avoids this call by pre-create the temp directory before submitting the job. However, we cannot do that for dynamic partitioning as we don't know the directory names. So for dynamic partitioning, we have some extra costs added for DFS namenode read. So far I think this tradeoff is worthwhile. Potentially this cost can be reduced it by caching directories created. We can try that approach as a followup.
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054200#comment-13054200 ]
He Yongqiang commented on HIVE-2201:
------------------------------------
i will take a look...
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Assignee: Siying Dong
Summary: reduce name node calls in hive by creating temporary directories (was: remove name node calls in hive by creating temporary directories)
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Yongqiang updated HIVE-2201:
-------------------------------
Status: Open (was: Patch Available)
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Status: In Progress (was: Patch Available)
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068787#comment-13068787 ]
He Yongqiang commented on HIVE-2201:
------------------------------------
+1, will commit after tests pass.
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch, HIVE-2201.4.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054584#comment-13054584 ]
jiraposter@reviews.apache.org commented on HIVE-2201:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/952/
-----------------------------------------------------------
Review request for hive, Yongqiang He, Ning Zhang, and namit jain.
Summary
-------
reduce name node calls in hive by creating temporary directories
This addresses bug HIVE-2201.
https://issues.apache.org/jira/browse/HIVE-2201
Diffs
-----
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/ExecDriver.java 1134223
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java 1134223
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 1134223
trunk/ql/src/java/org/apache/hadoop/hive/ql/io/RCFileOutputFormat.java 1134223
Diff: https://reviews.apache.org/r/952/diff
Testing
-------
Thanks,
Siying
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054188#comment-13054188 ]
Siying Dong commented on HIVE-2201:
-----------------------------------
ping
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054595#comment-13054595 ]
Siying Dong commented on HIVE-2201:
-----------------------------------
Yongqiang:
1. As I commented previously "According to Hairong Kuang, Hadoop's behavior for creating a new file is that it will automatically create it's parent directory if it doesn't exist. In that case, I removed the directory check and create part when writing to a new file."
2. I tested the codes. I ran the whole regression tests and tested several cases manually in the cluster. I tried to kill some tasks manually
3. I'll see whether there are another dependency so that I can remove the old one. Having two reloaded calls are the convention we have in the file. All other similar calls have one function with Path call and one with String call.
4. The tree traversal logic is copied from localizeMRTmpFilesImpl(). The first look is to go through every operator tree. The second loop is to Breadth-First Search the operator tree to check any FileSyncOperator.
5. OK. I'll make the change. My understanding is that only FileSinkOperator and the BlockMerge file sink have the problem and the second one is going to have some large changes by HIVE-2035. Also BlockMerge file sink suffers the problem less as it runs faster that has less change to have incomplete results.
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Attachment: HIVE-2201.1.patch
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HIVE-2201 started by Siying Dong.
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Status: Patch Available (was: In Progress)
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "Siying Dong (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siying Dong updated HIVE-2201:
------------------------------
Attachment: (was: HIVE-2201.1.patch)
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HIVE-2201) reduce name node calls in hive by
creating temporary directories
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-2201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Yongqiang resolved HIVE-2201.
--------------------------------
Resolution: Fixed
committed, thanks Siying!
> reduce name node calls in hive by creating temporary directories
> ----------------------------------------------------------------
>
> Key: HIVE-2201
> URL: https://issues.apache.org/jira/browse/HIVE-2201
> Project: Hive
> Issue Type: Improvement
> Reporter: Namit Jain
> Assignee: Siying Dong
> Attachments: HIVE-2201.1.patch, HIVE-2201.2.patch, HIVE-2201.3.patch, HIVE-2201.4.patch
>
>
> Currently, in Hive, when a file gets written by a FileSinkOperator,
> the sequence of operations is as follows:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp1/1
> 3. Move directory /tmp1 to /tmp2
> 4. For all files in /tmp2, remove all files starting with _tmp and
> duplicate files.
> Due to speculative execution, a lot of temporary files are created
> in /tmp1 (or /tmp2). This leads to a lot of name node calls,
> specially for large queries.
> The protocol above can be modified slightly:
> 1. In tmp directory tmp1, create a tmp file _tmp_1
> 2. At the end of the operator, move
> /tmp1/_tmp_1 to /tmp2/1
> 3. Move directory /tmp2 to /tmp3
> 4. For all files in /tmp3, remove all duplicate files.
> This should reduce the number of tmp files.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira