You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Sammy Yu (JIRA)" <ji...@apache.org> on 2010/09/02 02:39:55 UTC

[jira] Created: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
----------------------------------------------------------------------

                 Key: HIVE-1610
                 URL: https://issues.apache.org/jira/browse/HIVE-1610
             Project: Hadoop Hive
          Issue Type: Bug
         Environment: Hadoop 0.20.2
            Reporter: Sammy Yu


I have a relatively complicated hive query using CombinedHiveInputFormat:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true; 
set hive.exec.max.dynamic.partitions=1000;
set hive.exec.max.dynamic.partitions.pernode=300;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
 
This query use to work fine until I updated to r991183 on trunk and started getting this error:

java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)

This query works if I don't change the hive.input.format.
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905594#action_12905594 ] 

He Yongqiang commented on HIVE-1610:
------------------------------------

1.
just remove 
{noformat}
&& (dir.toUri().getScheme() == null || dir.toUri().getScheme().trim()
            .equals(""))
{noformat}

will make things work.

2. you need to use svn (not git) to generate the patch.

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammy Yu updated HIVE-1610:
---------------------------

    Attachment: 0003-HIVE-1610.patch

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905367#action_12905367 ] 

He Yongqiang commented on HIVE-1610:
------------------------------------

Yes. There is a bug in HiveFileFormatUtils.getPartitionDescFromPathRecursively

{noformat}
    if (part == null
        && (dir.toUri().getScheme() == null || dir.toUri().getScheme().trim()
            .equals(""))) {
{noformat}

We need to remove 
{noformat}
&& (dir.toUri().getScheme() == null || dir.toUri().getScheme().trim()
            .equals(""))
{noformat}

Sammy, can you help post a fix? You can add a testcase in TestHiveFileFormatUtils.

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammy Yu updated HIVE-1610:
---------------------------

    Attachment: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammy Yu updated HIVE-1610:
---------------------------

    Status: Patch Available  (was: Open)

You will have to forgive my ignorance since this is my first time looking at Hive source code.

I can validate that removing the additional check on the scheme works, however the original test case in TestHiveFileFormatUtils fails now.    The third assertion in first group fails for testGetPartitionDescFromPathRecursively.

I suspect that the root of the issue is that there is an additional :8020 part in the key of the pathToPartitionInfo table.  With the changes populateNewPartitionDesc would remove everything but the path so that it works.  I am wondering if the best approach is to make doGetPartitionDescFromPath aware of scheme.  I've attached a hack for this approach with the additional test case.  Please note it doesn't know anything about default ports for a given scheme.


> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905635#action_12905635 ] 

Sammy Yu commented on HIVE-1610:
--------------------------------

Yongqiang, thanks for taking a look at this.

If I take out the URI scheme checks, the original TestHiveFileFormatUtils.testGetPartitionDescFromPathRecursively test case fails:

    [junit] Running org.apache.hadoop.hive.ql.io.TestHiveFileFormatUtils
    [junit] junit.framework.TestListener: tests to run: 2
    [junit] junit.framework.TestListener: startTest(testGetPartitionDescFromPathRecursively)
    [junit] junit.framework.TestListener: addFailure(testGetPartitionDescFromPathRecursively, hdfs:///tbl/par1/part2/part3 should return null expected:<true> but was:<false>)
    [junit] junit.framework.TestListener: endTest(testGetPartitionDescFromPathRecursively)
    [junit] junit.framework.TestListener: startTest(testGetPartitionDescFromPathWithPort)
    [junit] junit.framework.TestListener: endTest(testGetPartitionDescFromPathWithPort)
    [junit] Tests run: 2, Failures: 1, Errors: 0, Time elapsed: 0.091 sec
    [junit] Test org.apache.hadoop.hive.ql.io.TestHiveFileFormatUtils FAILED

hdfs:///tbl/par1/part2/part3 should not match any PartitionDesc since the path in the map is file:///tbl/par1/part2/part3.  I will attach the svn version of the patch shortly.





> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sammy Yu updated HIVE-1610:
---------------------------

    Attachment: 0004-hive.patch

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch, 0004-hive.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "Sammy Yu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906898#action_12906898 ] 

Sammy Yu commented on HIVE-1610:
--------------------------------

He,  yes that's what the original 0002 patch does (it adds an additional check to ignore the port as well as test case for it).   I'm not sure why there's a disparity in the port being there in the first place.  I'll regenerate 0002 patch for svn against trunk@993445.  Thanks!


> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905751#action_12905751 ] 

He Yongqiang commented on HIVE-1610:
------------------------------------

Sammy, the only change in TestHiveFileFormatUtils is to remove URI scheme checks (1 line change). 
You actually added some lines of code which were removed by HIVE-1510, and this is the reason the testcase fails. 

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906837#action_12906837 ] 

He Yongqiang commented on HIVE-1610:
------------------------------------

Sammy, we can not fix this issue by just removing the schema check. 
If the input URI's path part is the same with one partition's path, but their schema is different, we should still return NULL.

For your case, the main problem is the port, which is contained in the partitionDesc but not in the input path.

Is it possible if we just ignore the port? I mean is there a case that two different instances share the same address but use different port?

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907058#action_12907058 ] 

He Yongqiang commented on HIVE-1610:
------------------------------------

Sammy, there are mainly 2 problems. 
1) going over the map is not efficient, and 2) using startWith to do prefix match is a bug fixed in HIVE-1510.

Sammy, can you change the logic as follows:

right now, hive generates another pathToPartitionInfo map by removing the path's schema information, and put it in a cacheMap. 
We can keep the same logic but change the new pathToPartitionInfo map's value to be an array of PartitionDesc. 
And then we can just remove the schema check, and once we get a match, we go through the array of PartitionDesc to find the best one.

this can also solve another problem. If there are 2 partitionDesc which's path part is same but the schema is different, only one is contained in the new pathToPartitionInfo map. 

About how to go through the array of PartitionDesc to find the best one:
if the array contains only 1 element, return array.get(0);
1) if the original input does not have any schema information:  if the array contains more then 1 element, report error.
2) if the original input contains schema information: 1) if the array contains an element which's the exact match (also contains schema and port, and the same with input), return the exact match. 2) ignore port part but keep the schema and address, and go through the array. 

what do you think?

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch, 0004-hive.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1610) Using CombinedHiveInputFormat causes partToPartitionInfo IOException

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1610:
-------------------------------

    Status: Open  (was: Patch Available)

> Using CombinedHiveInputFormat causes partToPartitionInfo IOException  
> ----------------------------------------------------------------------
>
>                 Key: HIVE-1610
>                 URL: https://issues.apache.org/jira/browse/HIVE-1610
>             Project: Hadoop Hive
>          Issue Type: Bug
>         Environment: Hadoop 0.20.2
>            Reporter: Sammy Yu
>         Attachments: 0002-HIVE-1610.-Added-additional-schema-check-to-doGetPar.patch, 0003-HIVE-1610.patch
>
>
> I have a relatively complicated hive query using CombinedHiveInputFormat:
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.dynamic.partition=true; 
> set hive.exec.max.dynamic.partitions=1000;
> set hive.exec.max.dynamic.partitions.pernode=300;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> INSERT OVERWRITE TABLE keyword_serp_results_no_dups PARTITION(week) select distinct keywords.keyword, keywords.domain, keywords.url, keywords.rank, keywords.universal_rank, keywords.serp_type, keywords.date_indexed, keywords.search_engine_type, keywords.week from keyword_serp_results keywords JOIN (select domain, keyword, search_engine_type, week, max_date_indexed, min(rank) as best_rank from (select keywords1.domain, keywords1.keyword, keywords1.search_engine_type,  keywords1.week, keywords1.rank, dupkeywords1.max_date_indexed from keyword_serp_results keywords1 JOIN (select domain, keyword, search_engine_type, week, max(date_indexed) as max_date_indexed from keyword_serp_results group by domain,keyword,search_engine_type,week) dupkeywords1 on keywords1.keyword = dupkeywords1.keyword AND  keywords1.domain = dupkeywords1.domain AND keywords1.search_engine_type = dupkeywords1.search_engine_type AND keywords1.week = dupkeywords1.week AND keywords1.date_indexed = dupkeywords1.max_date_indexed) dupkeywords2 group by domain,keyword,search_engine_type,week,max_date_indexed ) dupkeywords3 on keywords.keyword = dupkeywords3.keyword AND  keywords.domain = dupkeywords3.domain AND keywords.search_engine_type = dupkeywords3.search_engine_type AND keywords.week = dupkeywords3.week AND keywords.date_indexed = dupkeywords3.max_date_indexed AND keywords.rank = dupkeywords3.best_rank;
>  
> This query use to work fine until I updated to r991183 on trunk and started getting this error:
> java.io.IOException: cannot find dir = hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002/000000_0 in 
> partToPartitionInfo: [hdfs://ec2-75-101-174-245.compute-1.amazonaws.com:8020/tmp/hive-root/hive_2010-09-01_10-57-41_396_1409145025949924904/-mr-10002,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=417/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=418/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=419/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100829,
> hdfs://ec2-75-101-174-245.compute-1.amazonaws.com/user/root/domain_keywords/account=422/week=201035/day=20100831]
> at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:277)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat$CombineHiveInputSplit.<init>(CombineHiveInputFormat.java:100)
> at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:312)
> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
> at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
> at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:610)
> at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:120)
> at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:108)
> This query works if I don't change the hive.input.format.
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> I've narrowed down this issue to the commit for HIVE-1510.  If I take out the changeset from r987746, everything works as before.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.