You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Paul Yang (JIRA)" <ji...@apache.org> on 2010/04/30 20:47:03 UTC

[jira] Created: (HIVE-1332) Archiving partitions

Archiving partitions
--------------------

                 Key: HIVE-1332
                 URL: https://issues.apache.org/jira/browse/HIVE-1332
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Metastore
    Affects Versions: 0.6.0
            Reporter: Paul Yang
            Assignee: Paul Yang


Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.

One way to drastically reduce the number of files is to use hadoop archives:
http://hadoop.apache.org/common/docs/current/hadoop_archives.html

This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.

Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:

https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
https://issues.apache.org/jira/browse/HADOOP-6645
https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Status: Patch Available  (was: Open)

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Status: Patch Available  (was: Open)

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863465#action_12863465 ] 

Namit Jain commented on HIVE-1332:
----------------------------------

The same problem is present during unarchive. Once the existence of a har file have been checked, 2 processes running concurrently
can create files. So, we may duplicate the data

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876770#action_12876770 ] 

Namit Jain commented on HIVE-1332:
----------------------------------

Otherwise, it looks good

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-1332:
-----------------------------

          Status: Resolved  (was: Patch Available)
    Hadoop Flags: [Reviewed]
      Resolution: Fixed

Committed. Thanks Paul

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch, HIVE-1332.6.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876751#action_12876751 ] 

Paul Yang commented on HIVE-1332:
---------------------------------

Well, we can't quite test using CHIF because users will need to have the patch for MAPREDUCE-1806 (when it comes out), or else we will get an error.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Attachment: HIVE-1332.4.patch

This doesn't incorporate the hadoop-version aware test framework, but here's another version that has some additional fixes/tests.

* Handled table renaming
* Added a check when creating partitions to catch reserved values
* Additional tests for above

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Attachment: HIVE-1332.2.patch

* Adds negative test, changes way tests are skipped if the hadoop version is not supported
* Changes ordering of filesystem operations to prevent data loss / corruption, and allows recovery if interrupted by re-running command.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863570#action_12863570 ] 

Paul Yang commented on HIVE-1332:
---------------------------------

Checks that require access to the metadata should be done in DDLTask, not DDLSemanticAnalyzer, no?

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870861#action_12870861 ] 

Namit Jain commented on HIVE-1332:
----------------------------------

	      if ("har".equalsIgnoreCase(dest_path.toUri().getScheme())) {
3257	        throw new SemanticException(ErrorMsg.OVERWRITE_ARCHIVED_PART
3258	            .getMsg());
3259	      }



Do you think it is a better idea to check if the partition is ARCHIVED instead ?

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Attachment: HIVE-1332.5.patch

Updated to current trunk.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863525#action_12863525 ] 

Paul Yang commented on HIVE-1332:
---------------------------------

Yeah, the way the patch is now, concurrent operations were not supported as it was assumed these commands were going to be run via a single cron job. But that is probably not a good assumption to make. And the priority was to make the order of operations to prevent data loss in case of any failures. The reason why the (un)archive operation is tricky to do concurrently is because there are no ways to lock a partition/table (HIVE-1293) and there are no ways to atomically make a filesystem and metadata change. But there are ways of addressing these concurrency issues while preserving data during failure scenarios:

Archiving a partition using a conservative approach would involve something like (as discussed with Namit):

1. Create a copy of the partition, call it ds=1.copy
2. Alter metadata's location to point to ds=1.copy
-- At this point failures are okay as the copy is not touched
3. Make the archive of the partition directory in a tmp directory
4. Remove the directory ds=1
5. Move the tmp directory to ds=1
6. Alter metadata's location to point to har:/...ds=1
7. Delete ds=1.copy

These set of steps would ensure that no matter when failure occurs, subsequent queries on the partition will continue to succeed. However, this approach incurs the overhead of having to make a copy of the partition, which can be significant. Another approach is to:

1. Make the archive of the partition in a tmp directory
2. Move the archive folder to ds=1.copy
3. Move ds=1 to ds=1.old
4. Move ds=1.copy to ds=1
5. Alter the metada to change the location to har:/...ds=1

The drawback to this approach is that if a failure occurs, subsequent queries will not be able to properly access the data. However, the archive command can be run again to recover from the situation.

Also since the semantics for FileSystem.rename() do not throw an error if the destination directory already exists, there is a small window for data duplication. However, this issue is already present in INSERT OVERWRITE... These will be addressed with lock support.


> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870864#action_12870864 ] 

Paul Yang commented on HIVE-1332:
---------------------------------

Yes, that would be better. The check was from an earlier version of the patch.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Status: Open  (was: Patch Available)

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Attachment: HIVE-1332.1.patch

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863445#action_12863445 ] 

Paul Yang commented on HIVE-1332:
---------------------------------

Yes, that check occurs when we try to get the partition. If the user specifies a partial spec, then getPartition() will return null.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Attachment: HIVE-1332.3.patch

* Moved some constants to thrift definition

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1332:
---------------------------------

        Fix Version/s: 0.6.0
    Affects Version/s:     (was: 0.6.0)

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch, HIVE-1332.6.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868513#action_12868513 ] 

Paul Yang commented on HIVE-1332:
---------------------------------

Note that because the archiving process involves several filesystem operations, there can be errors if it's run concurrently on the same partition. Once HIVE-1293 is in, this feature will be updated to address this.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863416#action_12863416 ] 

Namit Jain commented on HIVE-1332:
----------------------------------

DDLSemanticAnalyzer.java

    622 private void analyzeAlterTableArchive(CommonTree ast, boolean isUnArchive)
		623	throws SemanticException {
		624
		625	if (!conf.getBoolVar(HiveConf.ConfVars.HIVEARCHIVEENABLED)) {
		626	throw new SemanticException("Archiving methods are currently disabled. " +
		627	"Please see the Hive wiki for more information about enabling archiving.");
		628
		629	}
		630	String tblName = unescapeIdentifier(ast.getChild(0).getText());
		631	// partition name to value
		632	List<Map<String, String>> partSpecs = getPartitionSpecs(ast);
		633	if (partSpecs.size() > 1 ) {
		634	throw new SemanticException(isUnArchive ? "UNARCHIVE" : "ARCHIVE" +
		635	" can only be run on a single partition");
		636	}
		637	if (partSpecs.size() == 0) {
		638	throw new SemanticException("ARCHIVE can only be run on partitions");


Add the error messages in ErrorMsg.java, and add negative tests for all of them.



DDLTask.java
    413	 // Means user specified a table
		414	if (simpleDesc.getPartSpec() == null) {
		415	throw new HiveException("ARCHIVE is for partitions only");
		416	}

Shouldn't this be checked in DDLSemanticAnalyzer instead ?



Same as above:

    421	 if (tbl.getTableType() != TableType.MANAGED_TABLE)  {
		422	throw new HiveException("ARCHIVE can only be performed on managed tables");
		423	}


and:

    429	 if (isArchived(p)) {
		430	throw new HiveException("Specified partition is already archived");
		431	}


One check that seems to be missing:

if we have multilple partition columns, say ds and hr.

and if the user tries to archive just by specifying ds, should that be allowed ?
I dont think it will work - are you checking that ?




> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927627#action_12927627 ] 

Paul Yang commented on HIVE-1332:
---------------------------------

Added archiving sections at:

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Alter_Table_.28Un.29Archive
http://wiki.apache.org/hadoop/Hive/LanguageManual/Archiving

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch, HIVE-1332.6.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863454#action_12863454 ] 

Namit Jain commented on HIVE-1332:
----------------------------------

Isn't there a race condition in archive() in DDLTask - if 2 archives for the same partition are running concurrently, 
you may endup creating a sub-directory in har directory.

For eg: you may endup with: /warehosue/T/P/Tbl.har/Tbl.har/....




> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Yang updated HIVE-1332:
----------------------------

    Attachment: HIVE-1332.6.patch

This should fix the test issues - I'm re-running the test suite now but let me know if you see anything.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch, HIVE-1332.6.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876749#action_12876749 ] 

Namit Jain commented on HIVE-1332:
----------------------------------

Some tests failed - send a mail offline to Paul.
Also, it would be useful if you can change constants like  METASTOREINTARCHIVED to METASTORE_INT_ARCHIVED etc.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Paul Yang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927231#action_12927231 ] 

Paul Yang commented on HIVE-1332:
---------------------------------

No, not yet. I'll update once it is posted.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch, HIVE-1332.6.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927228#action_12927228 ] 

John Sichi commented on HIVE-1332:
----------------------------------

Paul, is there wiki documentation available for this feature?


> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hive
>          Issue Type: New Feature
>          Components: Metastore
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>             Fix For: 0.6.0
>
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch, HIVE-1332.6.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876750#action_12876750 ] 

Namit Jain commented on HIVE-1332:
----------------------------------

Also, since the tests are anyway working only in hadoop 0.20, can we also test CombineHiveInputFormat for that scenario.

> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch, HIVE-1332.4.patch, HIVE-1332.5.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1332) Archiving partitions

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870865#action_12870865 ] 

Namit Jain commented on HIVE-1332:
----------------------------------

--nitpick

              wh.deleteDir(MetaStoreUtils.getOriginalLocation(part), true);

line 1113: HiveMetastore.java

archiveParentDir already contains the variable.

              wh.deleteDir(archiveParentDir, true);

should be fine



> Archiving partitions
> --------------------
>
>                 Key: HIVE-1332
>                 URL: https://issues.apache.org/jira/browse/HIVE-1332
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Metastore
>    Affects Versions: 0.6.0
>            Reporter: Paul Yang
>            Assignee: Paul Yang
>         Attachments: HIVE-1332.1.patch, HIVE-1332.2.patch, HIVE-1332.3.patch
>
>
> Partitions and tables in Hive typically consist of many files on HDFS. An issue is that as the number of files increase, there will be higher memory/load requirements on the namenode. Partitions in bucketed tables are a particular problem because they consist of many files, one for each of the buckets.
> One way to drastically reduce the number of files is to use hadoop archives:
> http://hadoop.apache.org/common/docs/current/hadoop_archives.html
> This feature would introduce an ALTER TABLE <table_name> ARCHIVE PARTITION <spec> that would automatically put the files for the partition into a HAR file. We would also have an UNARCHIVE option to convert the files in the partition back to the original files. Archived partitions would be slower to access, but they would have the same functionality and decrease the number of files drastically. Typically, only seldom accessed partitions would be archived.
> Hadoop archives are still somewhat new, so we'll only put in support for the latest released major version (0.20). Here are some bug fixes:
> https://issues.apache.org/jira/browse/HADOOP-6591 (Important - could potentially cause data loss without this fix)
> https://issues.apache.org/jira/browse/HADOOP-6645
> https://issues.apache.org/jira/browse/MAPREDUCE-1585

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.