You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "He Yongqiang (JIRA)" <ji...@apache.org> on 2010/08/06 07:57:16 UTC

[jira] Created: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

archive is not working when multiple partitions inside one table are archived.
------------------------------------------------------------------------------

                 Key: HIVE-1515
                 URL: https://issues.apache.org/jira/browse/HIVE-1515
             Project: Hadoop Hive
          Issue Type: Bug
            Reporter: He Yongqiang


set hive.exec.compress.output = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
set mapred.min.split.size=256;
set mapred.min.split.size.per.node=256;
set mapred.min.split.size.per.rack=256;
set mapred.max.split.size=256;

set hive.archive.enabled = true;

drop table combine_3_srcpart_seq_rc;

create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;

insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;

insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;

ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");

select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;

drop table combine_3_srcpart_seq_rc;


will fail.

java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har

The reason it fails is because:
there are 2 input paths (one for each partition) for the above query:
1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har

The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897537#action_12897537 ] 

HBase Review Board commented on HIVE-1515:
------------------------------------------

Message from: "Paul Yang" <py...@facebook.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/598/#review853
-----------------------------------------------------------


Talked to Yongqiang offline about this one. The way that this patch attempts to fix the caching issue is to append some path information to the host so that we create a new HAR filesystem instance for different HAR files. The way that this is implemented now, a "-" and path information in added to the host e.g. har://hdfs-localhost-user--warehouse--mytable:50030... if the original were har://hdfs-localhost:50030. However, the HAR filesystem does not ignore the stuff after the second "-" and so has errors when trying to connect to the underlying filesystem. A possible fix would be to modify HiveHarFileSystem to extend the initialize() method so that the characters after the second "-" is ignored.

- Paul





> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1515:
-------------------------------

    Attachment: hive-1515.2.patch

Attache a possible fix.

Talked with Namit and Paul this afternoon about this issue. Actually there is config which can disable FileSystem cache: fs.%s.impl.disable.cache . where %s is the filesystem schema, for archive, it's har.

So if you set "fs.har.impl.disable.cache" to false, the archive will automatically work. This should be the clean way to fix this issue.
In order to do this, you need to apply https://issues.apache.org/jira/browse/HADOOP-6231 if your hadoop does not include the code to disable FileSystem cache.

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch, hive-1515.2.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896755#action_12896755 ] 

HBase Review Board commented on HIVE-1515:
------------------------------------------

Message from: "Yongqiang He" <he...@gmail.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/598/
-----------------------------------------------------------

Review request for Hive Developers.


Summary
-------

archive is not working when multiple partitions inside one table are archived.


This addresses bug hive-1515.
    http://issues.apache.org/jira/browse/hive-1515


Diffs
-----

  trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java 982490 
  trunk/ql/src/test/queries/clientpositive/archive_2.q PRE-CREATION 
  trunk/ql/src/test/results/clientpositive/archive_2.q.out PRE-CREATION 

Diff: http://review.cloudera.org/r/598/diff


Testing
-------


Thanks,

Yongqiang




> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1515:
-------------------------------

    Attachment: hive-1515.2.patch

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch, hive-1515.2.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1515:
-------------------------------

               Status: Patch Available  (was: Open)
    Affects Version/s: 0.7.0

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1515:
-------------------------------

    Attachment:     (was: hive-1515.2.patch)

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1515:
-------------------------------

    Attachment: hive-1515.1.patch

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1515:
-------------------------------

    Assignee:     (was: He Yongqiang)

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>         Attachments: hive-1515.1.patch, hive-1515.2.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang reassigned HIVE-1515:
----------------------------------

    Assignee: He Yongqiang

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1515:
----------------------------

    Status: Open  (was: Patch Available)

See comments on reviewboard.

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-1515
>                 URL: https://issues.apache.org/jira/browse/HIVE-1515
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch
>
>
> set hive.exec.compress.output = true;
> set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string, hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00") select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001") select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> java.io.IOException: Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001 in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a FileSystem instance from a given path, it only take the path's scheme and username to lookup the cache. So when we do Path.getFileSystem for the second har path, it actually returns the file system handle for the first path.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.