You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/07/29 02:48:19 UTC
[jira] Created: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
FileSinkOperator should remove duplicated files from the same task based on file sizes
--------------------------------------------------------------------------------------
Key: HIVE-1492
URL: https://issues.apache.org/jira/browse/HIVE-1492
Project: Hadoop Hive
Issue Type: Bug
Reporter: Ning Zhang
FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
RE: [jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by Siying Dong <si...@facebook.com>.
Larger files are not guaranteed to be the right ones. (For example, there could be user defined transform scripts that can freely access external resources and generate anything which we don't have control.) But larger files, rather than the first one, are much more likely to be the correct one. Before we use the new MapReduce API to fix the issue of generating wrong results in MapReduce, this patch will help us fix the problem in most scenarios.
-----Original Message-----
From: He Yongqiang (JIRA) [mailto:jira@apache.org]
Sent: Thursday, July 29, 2010 12:12 PM
To: hive-dev@hadoop.apache.org
Subject: [jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893782#action_12893782 ]
He Yongqiang commented on HIVE-1492:
------------------------------------
The assumption of Map-reduce is
if we give same input and same m/r function, the output should be always the same.
Otherwise the map-reduce fault tolerance mechanism is wrong.
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Assigned: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ning Zhang reassigned HIVE-1492:
--------------------------------
Assignee: Ning Zhang
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Reopened: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Namit Jain reopened HIVE-1492:
------------------------------
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ning Zhang updated HIVE-1492:
-----------------------------
Attachment: HIVE-1492_branch-0.6.patch
Uploading a patch for branch-0.6.
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893460#action_12893460 ]
He Yongqiang commented on HIVE-1492:
------------------------------------
+1, looks good. will commit after tests pass.
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893786#action_12893786 ]
He Yongqiang commented on HIVE-1492:
------------------------------------
running test on branch-0.6
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Yongqiang updated HIVE-1492:
-------------------------------
Status: Resolved (was: Patch Available)
Fix Version/s: 0.7.0
Resolution: Fixed
I just committed. Thanks Ning!
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893782#action_12893782 ]
He Yongqiang commented on HIVE-1492:
------------------------------------
The assumption of Map-reduce is
if we give same input and same m/r function, the output should be always the same.
Otherwise the map-reduce fault tolerance mechanism is wrong.
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Carl Steinbach updated HIVE-1492:
---------------------------------
Fix Version/s: (was: 0.7.0)
Affects Version/s: (was: 0.7.0)
Component/s: Query Processor
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Components: Query Processor
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.6.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893772#action_12893772 ]
Edward Capriolo commented on HIVE-1492:
---------------------------------------
"the largest file is the correct file"
Is that generally true or an absolute fact?
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893907#action_12893907 ]
He Yongqiang commented on HIVE-1492:
------------------------------------
committed to branch-0.6 as well. Thanks John!
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Namit Jain resolved HIVE-1492.
------------------------------
Resolution: Fixed
Let us fix it in the follow-up
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893788#action_12893788 ]
Ning Zhang commented on HIVE-1492:
----------------------------------
@Edward, this is a heuristics that should be generally true. The good news is that we are not aware of any exceptions that violate the rule (assuming multiple attempts of the same task give deterministic results).
The reason that we are relying on heuristics here is that the old Hadoop API doesn't not support exception handling outside Mapper's map() function. The bug presents if an exception was thrown by Hadoop's RecordReader layer and it does not pass the message to the Mapper. When the mapper.close() is called there is not way the mapper know whether there is an exception happened in the Hadoop code path. A better way to handle this is to use the new Hadoop API that gives more control to the application layer. This heuristics is a workaround based on the old Hadoop API.
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ning Zhang updated HIVE-1492:
-----------------------------
Status: Patch Available (was: Open)
Affects Version/s: 0.7.0
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899152#action_12899152 ]
Ning Zhang commented on HIVE-1492:
----------------------------------
Agree that we should catch the exception in (Combine)HiveRecordReader, but they are only used in map side. In the reducer, RecordReader was not called and there could also be exceptions outside of reducer(). This fix catches that case as well.
I've filed another JIRA HIVE-1543 for catching exceptions in RecrodReaders.
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
He Yongqiang updated HIVE-1492:
-------------------------------
Fix Version/s: 0.6.0
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ning Zhang updated HIVE-1492:
-----------------------------
Attachment: HIVE-1492.patch
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-1492) FileSinkOperator should remove
duplicated files from the same task based on file sizes
Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899048#action_12899048 ]
Namit Jain commented on HIVE-1492:
----------------------------------
A better fix would be to catch next() in HiveRecordReader/CombineHiveRecordReader etc. and set the abort flag in ExecMapper in case of an exception.
There will be exactly 1 successful mapper in that case.
> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
> Key: HIVE-1492
> URL: https://issues.apache.org/jira/browse/HIVE-1492
> Project: Hadoop Hive
> Issue Type: Bug
> Affects Versions: 0.7.0
> Reporter: Ning Zhang
> Assignee: Ning Zhang
> Fix For: 0.6.0, 0.7.0
>
> Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.