You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Ning Zhang (JIRA)" <ji...@apache.org> on 2010/05/21 03:06:16 UTC

[jira] Created: (HIVE-1357) CombineHiveInputSplit should initialize the inputFileFormat once for a single split

CombineHiveInputSplit should initialize the inputFileFormat once for a single split
-----------------------------------------------------------------------------------

                 Key: HIVE-1357
                 URL: https://issues.apache.org/jira/browse/HIVE-1357
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: Ning Zhang
            Assignee: Ning Zhang


If a split consists of multiple files, the FileFormat should always be the same, whether RCFile or SequenceFile. Currently the CombineHiveInputSplit tries to get the inputFileFormat for each new file in the split, which is O(n) where n is the number of files in the split. This is an O(n^2) operation and degrade the performance badly for combining large number of small files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1357) CombineHiveInputSplit should initialize the inputFileFormat once for a single split

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869888#action_12869888 ] 

Namit Jain commented on HIVE-1357:
----------------------------------

+1

looks good

> CombineHiveInputSplit should initialize the inputFileFormat once for a single split
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-1357
>                 URL: https://issues.apache.org/jira/browse/HIVE-1357
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1357.patch
>
>
> If a split consists of multiple files, the FileFormat should always be the same, whether RCFile or SequenceFile. Currently the CombineHiveInputSplit tries to get the inputFileFormat for each new file in the split, which is O(n) where n is the number of files in the split. This is an O(n^2) operation and degrade the performance badly for combining large number of small files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1357) CombineHiveInputSplit should initialize the inputFileFormat once for a single split

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1357:
-----------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Committed. Thanks Ning

> CombineHiveInputSplit should initialize the inputFileFormat once for a single split
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-1357
>                 URL: https://issues.apache.org/jira/browse/HIVE-1357
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1357.patch
>
>
> If a split consists of multiple files, the FileFormat should always be the same, whether RCFile or SequenceFile. Currently the CombineHiveInputSplit tries to get the inputFileFormat for each new file in the split, which is O(n) where n is the number of files in the split. This is an O(n^2) operation and degrade the performance badly for combining large number of small files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1357) CombineHiveInputSplit should initialize the inputFileFormat once for a single split

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1357:
-----------------------------

    Status: Patch Available  (was: Open)

> CombineHiveInputSplit should initialize the inputFileFormat once for a single split
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-1357
>                 URL: https://issues.apache.org/jira/browse/HIVE-1357
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1357.patch
>
>
> If a split consists of multiple files, the FileFormat should always be the same, whether RCFile or SequenceFile. Currently the CombineHiveInputSplit tries to get the inputFileFormat for each new file in the split, which is O(n) where n is the number of files in the split. This is an O(n^2) operation and degrade the performance badly for combining large number of small files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1357) CombineHiveInputSplit should initialize the inputFileFormat once for a single split

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1357:
---------------------------------

    Fix Version/s: 0.6.0
      Component/s: Query Processor
                   Serializers/Deserializers

> CombineHiveInputSplit should initialize the inputFileFormat once for a single split
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-1357
>                 URL: https://issues.apache.org/jira/browse/HIVE-1357
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1357.patch
>
>
> If a split consists of multiple files, the FileFormat should always be the same, whether RCFile or SequenceFile. Currently the CombineHiveInputSplit tries to get the inputFileFormat for each new file in the split, which is O(n) where n is the number of files in the split. This is an O(n^2) operation and degrade the performance badly for combining large number of small files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1357) CombineHiveInputSplit should initialize the inputFileFormat once for a single split

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang updated HIVE-1357:
-----------------------------

    Attachment: HIVE-1357.patch

> CombineHiveInputSplit should initialize the inputFileFormat once for a single split
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-1357
>                 URL: https://issues.apache.org/jira/browse/HIVE-1357
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: HIVE-1357.patch
>
>
> If a split consists of multiple files, the FileFormat should always be the same, whether RCFile or SequenceFile. Currently the CombineHiveInputSplit tries to get the inputFileFormat for each new file in the split, which is O(n) where n is the number of files in the split. This is an O(n^2) operation and degrade the performance badly for combining large number of small files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.