You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/07/29 02:48:19 UTC

[jira] Created: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

FileSinkOperator should remove duplicated files from the same task based on file sizes
--------------------------------------------------------------------------------------

                 Key: HIVE-1492
                 URL: https://issues.apache.org/jira/browse/HIVE-1492
             Project: Hadoop Hive
          Issue Type: Bug
            Reporter: Ning Zhang


FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by Siying Dong <si...@facebook.com>.

Larger files are not guaranteed to be the right ones. (For example, there could be user defined transform scripts that can freely access external resources and generate anything which we don't have control.) But larger files, rather than the first one, are much more likely to be the correct one. Before we use the new MapReduce API to fix the issue of generating wrong results in MapReduce, this patch will help us fix the problem in most scenarios.

-----Original Message-----
From: He Yongqiang (JIRA) [mailto:jira@apache.org] 
Sent: Thursday, July 29, 2010 12:12 PM
To: hive-dev@hadoop.apache.org
Subject: [jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes


    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893782#action_12893782 ] 

He Yongqiang commented on HIVE-1492:
------------------------------------

The assumption of Map-reduce is 
if we give same input and same m/r function, the output should be always the same.

Otherwise the map-reduce fault tolerance mechanism is wrong.

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang reassigned HIVE-1492:
--------------------------------

    Assignee: Ning Zhang

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain reopened HIVE-1492:
------------------------------


> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1492:
-----------------------------

    Attachment: HIVE-1492_branch-0.6.patch

Uploading a patch for branch-0.6.

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893460#action_12893460 ] 

He Yongqiang commented on HIVE-1492:
------------------------------------

+1, looks good. will commit after tests pass.

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893786#action_12893786 ] 

He Yongqiang commented on HIVE-1492:
------------------------------------

running test on branch-0.6

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1492:
-------------------------------

           Status: Resolved  (was: Patch Available)
    Fix Version/s: 0.7.0
       Resolution: Fixed

I just committed. Thanks  Ning!

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893782#action_12893782 ] 

He Yongqiang commented on HIVE-1492:
------------------------------------

The assumption of Map-reduce is 
if we give same input and same m/r function, the output should be always the same.

Otherwise the map-reduce fault tolerance mechanism is wrong.

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1492:
---------------------------------

        Fix Version/s:     (was: 0.7.0)
    Affects Version/s:     (was: 0.7.0)
          Component/s: Query Processor

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Edward Capriolo (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893772#action_12893772 ] 

Edward Capriolo commented on HIVE-1492:
---------------------------------------

"the largest file is the correct file" 
Is that generally true or an absolute fact?

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893907#action_12893907 ] 

He Yongqiang commented on HIVE-1492:
------------------------------------

committed to branch-0.6 as well. Thanks John!

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain resolved HIVE-1492.
------------------------------

    Resolution: Fixed

Let us fix it in the follow-up 

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893788#action_12893788 ] 

Ning Zhang commented on HIVE-1492:
----------------------------------

@Edward, this is a heuristics that should be generally true. The good news is that we are not aware of any exceptions that violate the rule (assuming multiple attempts of the same task give deterministic results). 

The reason that we are relying on heuristics here is that the old Hadoop API doesn't not support exception handling outside Mapper's map() function. The bug presents if an exception was thrown by Hadoop's RecordReader layer and it does not pass the message to the Mapper. When the mapper.close() is called there is not way the mapper know whether there is an exception happened in the Hadoop code path. A better way to handle this is to use the new Hadoop API that gives more control to the application layer. This heuristics is a workaround based on the old Hadoop API. 


> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1492:
-----------------------------

               Status: Patch Available  (was: Open)
    Affects Version/s: 0.7.0

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>         Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899152#action_12899152 ] 

Ning Zhang commented on HIVE-1492:
----------------------------------

Agree that we should catch the exception in (Combine)HiveRecordReader, but they are only used in map side. In the reducer, RecordReader was not called and there could also be exceptions outside of reducer(). This fix catches that case as well.

I've filed another JIRA HIVE-1543 for catching exceptions in RecrodReaders. 



> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1492:
-------------------------------

    Fix Version/s: 0.6.0

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1492:
-----------------------------

    Attachment: HIVE-1492.patch

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>         Attachments: HIVE-1492.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1492) FileSinkOperator should remove duplicated files from the same task based on file sizes

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899048#action_12899048 ] 

Namit Jain commented on HIVE-1492:
----------------------------------

A better fix would be to catch next() in HiveRecordReader/CombineHiveRecordReader etc. and set the abort flag in ExecMapper in case of an exception.
There will be exactly 1 successful mapper in that case.

> FileSinkOperator should remove duplicated files from the same task based on file sizes
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-1492
>                 URL: https://issues.apache.org/jira/browse/HIVE-1492
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0, 0.7.0
>
>         Attachments: HIVE-1492.patch, HIVE-1492_branch-0.6.patch
>
>
> FileSinkOperator.jobClose() calls Utilities.removeTempOrDuplicateFiles() to retain only one file for each task. A task could produce multiple files due to failed attempts or speculative runs. The largest file should be retained rather than the first file for each task. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.