You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2010/07/26 18:20:50 UTC

[jira] Created: (HIVE-1488) CombineHiveInputFormat for hadoop-19 is broken

CombineHiveInputFormat for hadoop-19 is broken
----------------------------------------------

                 Key: HIVE-1488
                 URL: https://issues.apache.org/jira/browse/HIVE-1488
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
            Reporter: Joydeep Sen Sarma


I don't if anyone is using it. After making some recent testing related changes in HIVE-1408, combine[12].q are no longer working when testing against 19. I have seen them fail earlier as well and not investigated. Looking at the code, it seems pretty hokey:

getInputPathsShim():

      Path[] newPaths = new Path[paths.length];
      // remove file:                                                                                                                                          
      for (int pos = 0; pos < paths.length; pos++) {
        newPaths[pos] = new Path(paths[pos].toString().substring(5));
      }

since we are no longer using 'file:' namespace for test warehouse, this is broke. But this would be broken against any hdfs instance it would seem(?). Also not clear what we are trying to do here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1488) CombineHiveInputFormat for hadoop-19 is broken

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892467#action_12892467 ] 

Joydeep Sen Sarma commented on HIVE-1488:
-----------------------------------------

all for getting rid of stuff.

however  my understanding of the background is a little different. multifileinputformat can combine stuff inside a single dir - but does not do so based on locality. that was the biggest difference between CFIF and MFIF. Also - hive doesn't combine stuff across partitions (at least that has been my observation - would be happy to be corrected). so not sure that difference matters.

but given that no one uses it and the stuff is so obviously broke - i don't understand what the point of spending time on dead code is. so +1 for deprecating/removing this. (hadoop-19 was also not a particularly popular release - neither FB or Yahoo used it).

> CombineHiveInputFormat for hadoop-19 is broken
> ----------------------------------------------
>
>                 Key: HIVE-1488
>                 URL: https://issues.apache.org/jira/browse/HIVE-1488
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Ning Zhang
>
> I don't if anyone is using it. After making some recent testing related changes in HIVE-1408, combine[12].q are no longer working when testing against 19. I have seen them fail earlier as well and not investigated. Looking at the code, it seems pretty hokey:
> getInputPathsShim():
>       Path[] newPaths = new Path[paths.length];
>       // remove file:                                                                                                                                          
>       for (int pos = 0; pos < paths.length; pos++) {
>         newPaths[pos] = new Path(paths[pos].toString().substring(5));
>       }
> since we are no longer using 'file:' namespace for test warehouse, this is broke. But this would be broken against any hdfs instance it would seem(?). Also not clear what we are trying to do here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1488) CombineHiveInputFormat for hadoop-19 is broken

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892587#action_12892587 ] 

Ning Zhang commented on HIVE-1488:
----------------------------------

Both MultiFileInputFormat and CombineFileInputFormat (on which CombineHiveInputFormat is based) can combine multiple files into one split. In addition to the differences in locality, CFIF also provide the interface to define pools and implement filters so that you can define which files should be/should not be combined into one split. In CHIF, the logics of putting multiple files in one directory (but not in different directories) in one split is implemented in CombineHiveInputFormat.CombineFilter.

It seems the MFIF support in hadoop 0.19 was added based on some external use cases in the hive-user mailing list (http://osdir.com/ml/hive-user-hadoop-apache/2010-02/msg00007.html). I'm not sure whether anyone is still actively using it though. 


> CombineHiveInputFormat for hadoop-19 is broken
> ----------------------------------------------
>
>                 Key: HIVE-1488
>                 URL: https://issues.apache.org/jira/browse/HIVE-1488
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Ning Zhang
>
> I don't if anyone is using it. After making some recent testing related changes in HIVE-1408, combine[12].q are no longer working when testing against 19. I have seen them fail earlier as well and not investigated. Looking at the code, it seems pretty hokey:
> getInputPathsShim():
>       Path[] newPaths = new Path[paths.length];
>       // remove file:                                                                                                                                          
>       for (int pos = 0; pos < paths.length; pos++) {
>         newPaths[pos] = new Path(paths[pos].toString().substring(5));
>       }
> since we are no longer using 'file:' namespace for test warehouse, this is broke. But this would be broken against any hdfs instance it would seem(?). Also not clear what we are trying to do here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1488) CombineHiveInputFormat for hadoop-19 is broken

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892388#action_12892388 ] 

Ning Zhang commented on HIVE-1488:
----------------------------------

It should be removed. The corresponding code in shims 0.20 was removed by fixing HIVE-1317, but we didn't realized the same code was in 0.19. 


> CombineHiveInputFormat for hadoop-19 is broken
> ----------------------------------------------
>
>                 Key: HIVE-1488
>                 URL: https://issues.apache.org/jira/browse/HIVE-1488
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>
> I don't if anyone is using it. After making some recent testing related changes in HIVE-1408, combine[12].q are no longer working when testing against 19. I have seen them fail earlier as well and not investigated. Looking at the code, it seems pretty hokey:
> getInputPathsShim():
>       Path[] newPaths = new Path[paths.length];
>       // remove file:                                                                                                                                          
>       for (int pos = 0; pos < paths.length; pos++) {
>         newPaths[pos] = new Path(paths[pos].toString().substring(5));
>       }
> since we are no longer using 'file:' namespace for test warehouse, this is broke. But this would be broken against any hdfs instance it would seem(?). Also not clear what we are trying to do here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1488) CombineHiveInputFormat for hadoop-19 is broken

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892426#action_12892426 ] 

Ning Zhang commented on HIVE-1488:
----------------------------------

MultiFileInputFormat in Hadoop 0.19 was introduced in HIVE-1121 to implement CombineFileInputFormat-like functionality in Hadoop pre-20 version.  However, MultiFileInputFormat actually does not support pooling (creating different splits for files in different directories). So it is impossible to use CombineHiveInputFormat to query multiple partitions (each partition has to be in a different pool in the case of CombineFileInputFormat). 

Due to this limitation and there is no easy way to fix this in Hive, I think we should disable CombineHiveInputFormat in pre-0.20 in strict mode and give a warning to users in nonstrict mode. 

For unit testing, we can exclude combine2.q (which test CombineHiveInputFormat across partitions) from Hadoop 0.19.

Any thoughts?

> CombineHiveInputFormat for hadoop-19 is broken
> ----------------------------------------------
>
>                 Key: HIVE-1488
>                 URL: https://issues.apache.org/jira/browse/HIVE-1488
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Ning Zhang
>
> I don't if anyone is using it. After making some recent testing related changes in HIVE-1408, combine[12].q are no longer working when testing against 19. I have seen them fail earlier as well and not investigated. Looking at the code, it seems pretty hokey:
> getInputPathsShim():
>       Path[] newPaths = new Path[paths.length];
>       // remove file:                                                                                                                                          
>       for (int pos = 0; pos < paths.length; pos++) {
>         newPaths[pos] = new Path(paths[pos].toString().substring(5));
>       }
> since we are no longer using 'file:' namespace for test warehouse, this is broke. But this would be broken against any hdfs instance it would seem(?). Also not clear what we are trying to do here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-1488) CombineHiveInputFormat for hadoop-19 is broken

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joydeep Sen Sarma reassigned HIVE-1488:
---------------------------------------

    Assignee: Ning Zhang

> CombineHiveInputFormat for hadoop-19 is broken
> ----------------------------------------------
>
>                 Key: HIVE-1488
>                 URL: https://issues.apache.org/jira/browse/HIVE-1488
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Ning Zhang
>
> I don't if anyone is using it. After making some recent testing related changes in HIVE-1408, combine[12].q are no longer working when testing against 19. I have seen them fail earlier as well and not investigated. Looking at the code, it seems pretty hokey:
> getInputPathsShim():
>       Path[] newPaths = new Path[paths.length];
>       // remove file:                                                                                                                                          
>       for (int pos = 0; pos < paths.length; pos++) {
>         newPaths[pos] = new Path(paths[pos].toString().substring(5));
>       }
> since we are no longer using 'file:' namespace for test warehouse, this is broke. But this would be broken against any hdfs instance it would seem(?). Also not clear what we are trying to do here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1488) CombineHiveInputFormat for hadoop-19 is broken

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12892411#action_12892411 ] 

Ning Zhang commented on HIVE-1488:
----------------------------------

Tried to simply remove the code from Hadoop19Shims.java and it still gives diffs in combine2.q. CombineHiveInputFormat in hadoop 19 extends MultiFileInputFormat while it extends CombineFileInputFormat in hadoop 20. So there may be some differences in 19. Will look into it in more details. 

> CombineHiveInputFormat for hadoop-19 is broken
> ----------------------------------------------
>
>                 Key: HIVE-1488
>                 URL: https://issues.apache.org/jira/browse/HIVE-1488
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>            Assignee: Ning Zhang
>
> I don't if anyone is using it. After making some recent testing related changes in HIVE-1408, combine[12].q are no longer working when testing against 19. I have seen them fail earlier as well and not investigated. Looking at the code, it seems pretty hokey:
> getInputPathsShim():
>       Path[] newPaths = new Path[paths.length];
>       // remove file:                                                                                                                                          
>       for (int pos = 0; pos < paths.length; pos++) {
>         newPaths[pos] = new Path(paths[pos].toString().substring(5));
>       }
> since we are no longer using 'file:' namespace for test warehouse, this is broke. But this would be broken against any hdfs instance it would seem(?). Also not clear what we are trying to do here.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.