You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Mahadev konar (JIRA)" <ji...@apache.org> on 2008/04/24 22:31:24 UTC

[jira] Created: (HADOOP-3307) Archives in Hadoop.

Archives in Hadoop.
-------------------

                 Key: HADOOP-3307
                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
             Project: Hadoop Core
          Issue Type: New Feature
          Components: fs
            Reporter: Mahadev konar
            Assignee: Mahadev konar
             Fix For: 0.18.0


This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Status: Open  (was: Patch Available)

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Status: Open  (was: Patch Available)

deleted the patch.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_4.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593084#action_12593084 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

> Perhaps. But we don't want to go there in this issue, right?
True

I think ill go with the original proposal of implicit hars. The name of a file within the archive directory is more confusing :).

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592485#action_12592485 ] 

Andrzej Bialecki  commented on HADOOP-3307:
-------------------------------------------

bq. On unix, mounts are not managed by a filesystem implementation, but by the kernel.

(Off-topic) Strictly speaking, yes - but in reality many filesystems are implemented as loadable modules (handlers) so it's not the monolithic kernel that handles all file ops. And many implementations exist only in user space (FUSE). Kernel then only makes sure that file ops under this mount point are delegated to the appropriate handler. Which is the model that I suggested here - in our case the client FileSystem abstraction would play this role.

bq. If we were to add a mount mechanism, we should add it at the generic FileSystem level, not within HDFS.

Correct, that's what I was suggesting.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Attachment:     (was: hadoop-3307_3.patch)

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_4.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592465#action_12592465 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

no its isnt that ugly. the real problem is not ! in the path uri that I suggest but having an opaque uri. 

har://hdfs-host:port/dir/my.har/file/in/har is still opaque. 

So paths do not work with this as well. Are you suggesting that we change Path to work with opaque uri's?



> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592471#action_12592471 ] 

mahadev edited comment on HADOOP-3307 at 4/25/08 12:05 PM:
-----------------------------------------------------------------

 isnt it the same as mounts as I suggested? Wouldnt this also require changes to DFS Namenode?  I am thinking of implementing Archives without changes on namenode side (for the first version at least).


      was (Author: mahadev):
     isnt it the same as mounts as I suggested? Wouldnt this also require changes to DFS Namenode?  I am thinking of implementing Archives without changes on namenode side. 

  
> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Devaraj Das updated HADOOP-3307:
--------------------------------

      Resolution: Fixed
    Release Note: Adds support for archives in hadoop. A mapreduce job can be run to create an archive with indexes. A FileSystem abstraction is provided over the archive.
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

I just committed this. Thanks, Mahadev!

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_4.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Status: Patch Available  (was: Open)

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592511#action_12592511 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

Nicholas, archives are directories, not files, right? The har filesystem implementation should take the path up to the first ".har" element and assume that names an archive directory.  Attempting to open har://hdfs-host:port/dir/my.har/foo.bar should throw an exception if my.har is not an archive-formatted directory..  This should naturally permit nested har files, if that's desired.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Attachment: hadoop-3307_2.patch

this is an updated patch with better comments and a few bug fixes which i found while testing corner cases.



> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592678#action_12592678 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

>am also curious about the 'parallel creation' aspect (since that seems to be the main argument for using a new > achive format). how do we populate a single hdfs file (backing the archive) in parallel? 

The archive isnt a single file backed by an index but multiple files 

quoting from the design posted earlier in the comments:

The format of an archive as a filesystem path is:

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-*

The indexes store the filenames and the offset with the part files.

Each map would create part-$i files and a single reduce or multiple reduces could create the index files in the archive directory. Does that help in understanding the design?


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592912#action_12592912 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

what about using this URI for har filesystem -- 

har://hdfs-host:port/dir/foo.har?pathinsideharfilesystem

SO the query of the uri is actually the path inside the har filesystem. 

This might require some changes to path but looks like a cleaner way rather than assuming .har extension as the har archive.

example would be 

har://hdfs-port:port/dir/foo.har?dir1/file1

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592458#action_12592458 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

you are right Doug. The opaque uri would not work with how Path works right now and I think it would be difficult to get it working as well. The suggestions with uri escaping looks really ugly and difficult to users to understand. Not sure how to get around this problem.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592164#action_12592164 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

Here is the design for the archives. 

Archiving files in HDFS

-- Motivation-- 

The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving  utility that is able to archive these files which are semi transparent and usable by map reduce. 

-- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct  files and would sometime like to unarchive and not lose the file layouts.

-- Requirements-- 
 -- transparent or semi transparent usage of archives. 
 -- Must be able to archive and unarchive in parallel 
 -- Changeable archives is not a requirement but the design should not prevent it to be implemented later.
 -- Compression is not a goal.

-- Archive Format --
-- Conventional archive formats like tar are not convenient for parallel archive creation 
-- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-- URI Syntax -- 
Har FileSystem is a client side filesystem which is semitransparent. 
-- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

-- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.

-- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. 

-- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.

-- Future Work:

-- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.
 
Comments?





> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592463#action_12592463 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

> How about assuming file names ending with .har to be considered as special file format.

I'd prefer not doing a naming hack like that directly in HDFS or in FileSystem.  But I don't mind doing it in a layered filesystem, perhaps something like:

har://hdfs-host:port/dir/my.har/file/in/har

So the "har" FileSystem could pull the nested scheme off the front of the host, and scan the path for a ".har", parse the index there (caching it, presumably), and finally access the file.  In the above case, the har path would be hdfs://host:port/dir/my.har.  No changes to FileSystem or HDFS are required.  That's not too ugly, is it?

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601893#action_12601893 ] 

Devaraj Das commented on HADOOP-3307:
-------------------------------------

In the testcase, it'd be nice if you read the whole file content instead of just 4 bytes, and then validate. That'd tell you that there is no extra (spurious) bytes in the archived files, right?


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599884#action_12599884 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

This sounds great!  Using the default filesystem makes the URIs much more readable!

> bin/hadoop archives -archiveName foo.har inputpaths outputdir

- can we name the command 'archive' instead of 'archives'?
- can the output name and directory be combined?

If so, the command might look like:

bin/hadoop archive dir/foo.har dir1 dir2 [ ... ]


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593064#action_12593064 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

> har://hdfs-port:port/dir/foo.har?dir1/file1

The problem with this is that path operations like getParent() wouldn't work.


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Status: Patch Available  (was: Open)

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Status: Patch Available  (was: Open)

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_4.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Robert Chansler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Chansler updated HADOOP-3307:
------------------------------------

    Release Note: Introduced archive feature to Hadoop. A Map/Reduce job can be run to create an archive with indexes. A FileSystem abstraction is provided over the archive.  (was: Adds support for archives in hadoop. A mapreduce job can be run to create an archive with indexes. A FileSystem abstraction is provided over the archive.)

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_4.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593082#action_12593082 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

> We would need mounts to make it transparent and some storage in the filesystem to know what we are supposed to do with the mounts.

Perhaps.  But we don't want to go there in this issue, right?

> I was trying to get rid of the implicit assumption. [ ... ]

The assumption that an archive directory name end with ".har"?  Is that what troubles you?

Here's another option: use percent-encoding to name the archive dir in the authority, e.g., har://hdfs:%2F%2Fhost:port%2Fdir%2Farchive/a/b.  This is harder to read, but otherwise elegant and general.

Or yet another option: use the name of a file *within* the archive directory, like the index or parts which will also presumably be fixed, not user-variable.  Then a path might look something like har://hdfs-host:port/dir/archive/har.index/a/b, where "har.index" is hardwired.


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592508#action_12592508 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

i dont see why the nested archives cannot be supported. For now (since I have just started coding) I think we should be able to support it but Ill keep that in mind!! 


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592164#action_12592164 ] 

mahadev edited comment on HADOOP-3307 at 4/24/08 1:36 PM:
----------------------------------------------------------------

Here is the design for the archives. 

Archiving files in HDFS

- *Motivation* 

The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving  utility that is able to archive these files which are semi transparent and usable by map reduce. 

- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct  files and would sometime like to unarchive and not lose the file layouts.

-  *Requirements* 
 - transparent or semi transparent usage of archives. 
 - Must be able to archive and unarchive in parallel 
 - Changeable archives is not a requirement but the design should not prevent it to be implemented later.
 - Compression is not a goal.

-  *Archive Format *
- Conventional archive formats like tar are not convenient for parallel archive creation 
- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-  *URI Syntax *
Har FileSystem is a client side filesystem which is semitransparent. 
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.

- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. 

- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.

- *Future Work*

- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.
 
Comments?





      was (Author: mahadev):
    Here is the design for the archives. 

Archiving files in HDFS

-- Motivation-- 

The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving  utility that is able to archive these files which are semi transparent and usable by map reduce. 

-- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct  files and would sometime like to unarchive and not lose the file layouts.

-- Requirements-- 
 -- transparent or semi transparent usage of archives. 
 -- Must be able to archive and unarchive in parallel 
 -- Changeable archives is not a requirement but the design should not prevent it to be implemented later.
 -- Compression is not a goal.

-- Archive Format --
-- Conventional archive formats like tar are not convenient for parallel archive creation 
-- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-- URI Syntax -- 
Har FileSystem is a client side filesystem which is semitransparent. 
-- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

-- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.

-- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. 

-- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.

-- Future Work:

-- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.
 
Comments?




  
> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592473#action_12592473 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

What I suggested is not opaque, but hierarchical.  The nested scheme is pasted onto the front of the authority with a dash.  This would fail for schemes that have a dash.  If that's a problem, we could use something like hdfs-ar://host:port/dir/my.har/file.  This would require adding an entry to the configuration per embedded scheme, rather than a single entry for all schemes--not a big deal.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592557#action_12592557 ] 

Joydeep Sen Sarma commented on HADOOP-3307:
-------------------------------------------

if 'har' is truly a client side abstraction - then the assumption that the protocol is hdfs - breaks this abstraction - no? one could imagine har archives on top of local file system - or for that matter KFS or any other future file system (say Lustre?).

also - the 'har' protocol is redundantly indicated in the uri scheme as well as the file extension. conceivably - one could drop it from the uri scheme (and thereby retain the ability to work with different file systems) and use the presence of the .har extension in the file path to automatically layer on a archive file system.

if done right - one should be able to support any archive format no? essentially - we are just associating the .har extension as a trigger to switch over to some nested file system (in this case, the har file system). one would think that in future a .zip extension could be associated with a ZIP file system provider which would allow nested view of the files/directories underneath .. (this would be, quite nice, since many data sets float around as zip files. one could just copy them into hdfs - and pronto - we are all set).

am also curious about the 'parallel creation' aspect (since that seems to be the main argument for using a new archive format). how do we populate a single hdfs file (backing the archive) in parallel? 

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592164#action_12592164 ] 

mahadev edited comment on HADOOP-3307 at 4/24/08 1:36 PM:
----------------------------------------------------------------

Here is the design for the archives. 

Archiving files in HDFS

- *Motivation* 

The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving  utility that is able to archive these files which are semi transparent and usable by map reduce. 

- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct  files and would sometime like to unarchive and not lose the file layouts.

-  *Requirements* 
 - transparent or semi transparent usage of archives. 
 - Must be able to archive and unarchive in parallel 
 - Changeable archives is not a requirement but the design should not prevent it to be implemented later.
 - Compression is not a goal.

-  *Archive Format*
- Conventional archive formats like tar are not convenient for parallel archive creation 
- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-  *URI Syntax*
Har FileSystem is a client side filesystem which is semitransparent. 
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.

- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. 

- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.

- *Future Work*

- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.
 
Comments?





      was (Author: mahadev):
    Here is the design for the archives. 

Archiving files in HDFS

- *Motivation* 

The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving  utility that is able to archive these files which are semi transparent and usable by map reduce. 

- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct  files and would sometime like to unarchive and not lose the file layouts.

-  *Requirements* 
 - transparent or semi transparent usage of archives. 
 - Must be able to archive and unarchive in parallel 
 - Changeable archives is not a requirement but the design should not prevent it to be implemented later.
 - Compression is not a goal.

-  *Archive Format *
- Conventional archive formats like tar are not convenient for parallel archive creation 
- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-  *URI Syntax*
Har FileSystem is a client side filesystem which is semitransparent. 
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.

- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. 

- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.

- *Future Work*

- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.
 
Comments?




  
> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601632#action_12601632 ] 

Devaraj Das commented on HADOOP-3307:
-------------------------------------

1) The query part in the creation of the URI can be removed (in fact we probably should flag an error if the har path contains a '?' since it is not a valid Path)
2) decodeURI should be done first and then the har archive path can be extracted
3) getHarAuth needn't be parsing the uri everytime since it is constant. The auth can just be stored in a class variable.
4) open() & other filesystem calls should support taking just the fragment path to a file within the archive
5) why is fileStatusInIndex storing the Store object in a list while going through the master index? Isn't the list going to be always of size 1 (if the file is present in the archive)
6) The index files are not closed in the fileStatusInIndex call. This might lead to problems in the cases where the underlying filesystem is the localfs (where open actually returns a filedescriptor). But I am also not sure whether we should open and close on every call to fileStatusInIndex. Can we somehow cache the handles to the index files and reuse them.
7) When we create a part file, can we record the things like replication factor, permissions, etc. and emit them just like we emit the other info like partfilename, etc. during archive creation and store them in the index file. That way we don't have to fake everything in the listStatus.
8) In listStatus, the start and end braces are missing for the if/else block
9) In listStatus, the check hstatus.isDir()?0:hstatus.getLength() seems redundant. hstatus.isDir is always going to be false
10) I don't understand clearly why makeRelative is done in the listStatus and getFileStatus calls
11) Do you enforce the .har in the archive name when it is created?

I am not done reviewing the entire patch yet ..

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Attachment: hadoop-3307_3.patch

this patch fixes all of devaraj's comments. 

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_3.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Status: Open  (was: Patch Available)

trying hudson again... 

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592499#action_12592499 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

ok coming back to the topic of discussion :) ... 

I like Doug's idea of
har://hdfs-host:port/dir/my.har/file/in/har

and we assume any directory ending with .har is a an archive and the path following it is the path in archive. 



> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592468#action_12592468 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

sorry i think i missed this 

would the uri be 

har://hdfs-host:port/dir/my.har/file/in/har 

or har:hdfs://host:port/dir/my.har/file/in/har

the first one is not opaque but will only work with HDFS. We are implicitly assuming that this is a hdfs archives. 

that might be ok though. I am not against it.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592177#action_12592177 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

Note that har:<path>! is a non-hierarchical, opaque URI.  Much of the Path code assumes that URIs are hierarchical and would need to be altered to support opaque uris.

One alternative would be to always "mount" hars before access.  Mounting would just require setting a "fs.har.<name>" property to a har file.

For example, a job could add a mount with:

job.set("fs.har.myfiles", "hdfs://host:port/dir/my.har");

Then specify its input as:

job.addInputPath("har://myfiles/path/in/har");

Another alternative could be to somehow escape paths in the authority of har: uris, e.g.:

har://hdfs-c-s-shost-c999-sdir/path_in_har

Where -c and -s are escapes for colon and slash.  Then the uris could still be hierarchical.  The downside is that paths would look really ugly.  Sigh.

If we wanted to make it transparent, then we might do it by adding symbolic links to the FileSystem API, rather than hacking DFSClient.  Then one could "mount" a har file by simply linking to a har: URI.


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "lohit vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592523#action_12592523 ] 

lohit vijayarenu commented on HADOOP-3307:
------------------------------------------

Can we distinguish a directory ending with .har to be an archive only if it has index file in it. 

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602179#action_12602179 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

and fixes findbugs warnings

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_3.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593067#action_12593067 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

the intent is to change path to make it work.... 

Would it not be possible to make these changes without breaking things that used to work?

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593079#action_12593079 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

I dont htink just symbolik links would solve the problem. We would need mounts to make it transparent and some storage in the filesystem to know what we are supposed to do with the mounts.
I was going to go with the second option -- parsing for query on every path which does seem like a bad idea. The syntax right now is a little ambiguous and has implicit asusmptiosn about the har filesystem. I was trying to get rid of the implicit assumption.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602460#action_12602460 ] 

Hadoop QA commented on HADOOP-3307:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12383399/hadoop-3307_4.patch
  against trunk revision 663337.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2576/console

This message is automatically generated.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_4.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Attachment: hadoop-3307_4.patch

attaching a new patch that gets rid of the find bugs warnings.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_4.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592621#action_12592621 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

> the assumption that the protocol is hdfs

No, the nested URI scheme is appended to the front of the host.  A KFS archive would be:

har://kfs-host:port/dir/foo.har/a/b

which would name the file /a/b within the archive kfs://host:port/dir/foo.har.

> if done right - one should be able to support any archive format no?

That's a different goal.  Most archive formats are not Hadoop friendly.  The goal here is to develop a Hadoop-friendly archive format.



> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "lohit vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592461#action_12592461 ] 

lohit vijayarenu commented on HADOOP-3307:
------------------------------------------

How about assuming file names ending with .har to be considered as special file format. 


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592503#action_12592503 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3307:
------------------------------------------------

> we assume any directory ending with .har is a an archive and the path following it is the path in archive.

Consider har://hdfs-host:port/dir/my.har/file/in/har.  I think the assumption should be
* if my.har is a file, then follow the path in archive.
* if my.har is a directory, treat it as a normal directory.

*Questions*: Do we support nested archive?  What will you do for something like har://hdfs-host:port/dir/foo.har/bar.har/file?


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592514#action_12592514 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3307:
------------------------------------------------

Oops, I thought that the archives are files like tar.  Then I have another question:
In har://hdfs-host:port/dir/foo.har/bar.har/file, what is the behavior if foo.har is indeed a directory and bar.har is an archive?

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592164#action_12592164 ] 

mahadev edited comment on HADOOP-3307 at 4/24/08 1:36 PM:
----------------------------------------------------------------

Here is the design for the archives. 

Archiving files in HDFS

- *Motivation* 

The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving  utility that is able to archive these files which are semi transparent and usable by map reduce. 

- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct  files and would sometime like to unarchive and not lose the file layouts.

-  *Requirements* 
 - transparent or semi transparent usage of archives. 
 - Must be able to archive and unarchive in parallel 
 - Changeable archives is not a requirement but the design should not prevent it to be implemented later.
 - Compression is not a goal.

-  *Archive Format *
- Conventional archive formats like tar are not convenient for parallel archive creation 
- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-  *URI Syntax*
Har FileSystem is a client side filesystem which is semitransparent. 
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.

- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. 

- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.

- *Future Work*

- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.
 
Comments?





      was (Author: mahadev):
    Here is the design for the archives. 

Archiving files in HDFS

- *Motivation* 

The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving  utility that is able to archive these files which are semi transparent and usable by map reduce. 

- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct  files and would sometime like to unarchive and not lose the file layouts.

-  *Requirements* 
 - transparent or semi transparent usage of archives. 
 - Must be able to archive and unarchive in parallel 
 - Changeable archives is not a requirement but the design should not prevent it to be implemented later.
 - Compression is not a goal.

-  *Archive Format *
- Conventional archive formats like tar are not convenient for parallel archive creation 
- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-  *URI Syntax *
Har FileSystem is a client side filesystem which is semitransparent. 
- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.

- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. 

- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.

- *Future Work*

- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.
 
Comments?




  
> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602124#action_12602124 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

Repsonse to devaraj's comments - 

> 1) The query part in the creation of the URI can be removed (in fact we probably should flag an error if the har path contains a '?' since it is not a valid Path)
agreed

> 2) decodeURI should be done first and then the har archive path can be extracted
agreed

> 3) getHarAuth needn't be parsing the uri everytime since it is constant. The auth can just be stored in a class variable.
will do

> 4) open() & other filesystem calls should support taking just the fragment path to a file within the archive
makes sense

> 5) why is fileStatusInIndex storing the Store object in a list while going through the master index? Isn't the list going to be always of size 1 (if the file is present in the archive)
no the list can be more than one since hashcode will have collisions and might end up between different buckets

> 6) the index files are not closed in the fileStatusInIndex call. This might lead to problems in the cases where the underlying filesystem is the localfs (where open actually returns a filedescriptor). But I am also not sure whether we should open and close on every call to fileStatusInIndex. Can we somehow cache the handles to the index files and reuse them.
for now I will just be closing and opening the files. Ill leave this optimization for later.

> 7) When we create a part file, can we record the things like replication factor, permissions, etc. and emit them just like we emit the other info like partfilename, etc. during archive creation and store them in the index file. That way we don't have to fake everything in the listStatus.

this was stated in the  design that we will be ignoring permissions in this version. In  later versions we can persist the permissions as well.  

> 8) In listStatus, the start and end braces are missing for the if/else block
will fix

> In listStatus, the check hstatus.isDir()?0:hstatus.getLength() seems redundant. hstatus.isDir is always going to be false
will fix

> 10) I don't understand clearly why makeRelative is done in the listStatus and getFileStatus calls
Its just to join two paths that are assolute. 
SO a path in archive is persisted as /user/mahadev so to have this as the later componet of the archive its just made relative to create a new Path 

> Do you enforce the .har in the archive name when it is created?
yes.

> 1) In writeTopLevelDirs, remove the comment "invert the paths"
will do

> 2) the SRC_LIST_LABEL file needs to have a high replication factor (maybe 10 or something)
makes sense

> Use NullOutputFormat instead of the HarOutputFormat
Did not know we had a nulloutputformat!! :) 

> 4) Overall the coding convention is that we have starting/terminating braces even for single statement blocks. Please update the code w.r.t that.
will do

> Overall, the path manipulations (like makeRelative/absolute) confuses me. It'd be nice to cleanup the code in that aspect if possible.
will try to clean up the code.

> In the testcase, it'd be nice if you read the whole file content instead of just 4 bytes, and then validate. That'd tell you that there is no extra (spurious) bytes in the archived files, right?
makes sense.


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Devaraj Das (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601873#action_12601873 ] 

Devaraj Das commented on HADOOP-3307:
-------------------------------------

Some more comments:
1) In writeTopLevelDirs, remove the comment "invert the paths"
2) the SRC_LIST_LABEL file needs to have a high replication factor (maybe 10 or something)
3) Use NullOutputFormat instead of the HarOutputFormat
4) Overall the coding convention is that we have starting/terminating braces even for single statement blocks. Please update the code w.r.t that.

Overall, the path manipulations (like makeRelative/absolute) confuses me. It'd be nice to cleanup the code in that aspect if possible.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Attachment: hadoop-3307_1.patch

this patch addresses the archives isssue. 

This patch includes the following -- 

- har:///user/mahadev/foo.har 

denotes a Hadoop archive. This is default uri which will use the default underlying filesystem specififed in your conf. 

In case you want to be explicit or some other hdfs (not the defautlt one )

then the uri is -- 

har://hdfs-host:port/user/mahadev/foo.har

The uri's have an implicit assumption on which part of the uri denotes the directory for  hadoop archives. The code looks the path from the end and assumes the part matching *.har to be the directory that is the archive.


- it has a filesystem layer so all the commands like 

hadoop fs -ls har:///user/mahadev/foo.har 

work. Most of the mutating commands are not implemented in the archives. -cat -copytolocal work as expected. 

- works with map reduce. 

so the input to a map reduce job could be har:///user/mahadev/foo.har and this would work fine.

Code Design and explanation - 

- There are two index files _index file contains files of the form 
  filename <dir>/<file> partfile startindex size childpathnames_if_directory.
  The _index file is sorted by hashcode of filenames.
  The second index file _masterindex contains pointers into the index file to speed up the lookuptime of files inside the _index file. 

- To create an archive user need to run 
  bin/hadoop archives -archiveName foo.har inputpaths outputdir
 
  This is a map reduce job wherein all the files are distributed amongst the maps which create part files of around 2GB or so. The reduce then get the startindex and size ffrom the maps for all the files and creates the _index and _masterindex. 

- Permissions are not persisted. So the permissions returned by the Har filesystem are the same as those of index files. 



> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12600638#action_12600638 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

in reply to doug's comments:

> can we name the command 'archive' instead of 'archives'?

the name was archive always :). I just mistyped it.

> can the output name and directory be combined?

I am not strongly against it - i just feel the current command line is more user friendly making it clear that there is an archivename which should end with .har. I might be wrong. I can go either way. I am not strongly against or for any of those.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mahadev konar updated HADOOP-3307:
----------------------------------

    Status: Patch Available  (was: Open)

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_3.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12593070#action_12593070 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

> the intent is to change path to make it work.... 

Would you special case the handling of "har:" uri's in Path?  Or would you always parse queries as part of the hierarchical path?  Both of these sound like bad ideas to me.

We should not add special functionality to FileSystem or Path for "har:" uris.  We have a proposal that layers cleanly on top of the existing FileSystem and Path implementations.  Alternately, we might consider generic extensions to FileSystem and/or Path, like symbolic links or mount points, to see whether these might facilitate a more transparent archive implementation.  But we should not add special-purpose hacks for a particular archive format to these generic classes.

Mounts of various sorts would be fairly easy to add, but perhaps not that easy to use.  I proposed a simple version above that requires no changes to existing code.  A mount capability that permitted one to attach a FileSystem implementation at an arbitrary point in the URI space would not be overly hard to add.

The primary downside of mount-based approaches is that they require state.  One would have to add something to the configuration or job for each mount point, or require all FileSystem implementations to know how to store a mount, or add a mount file type, or somesuch.  Note that this is not a problem with Unix mount, since there's only one system involved, but in a distributed system like Hadoop we need to either transmit the mount points with code (e.g., in the job) or somehow store them in the filesystem.

The current proposal, embedding the URI of the archive within a "har:" uri, will both solve the problems at hand and require no architectural changes to the filesystem.  The only downside is that archive file naming is a little obtuse.  Long-term, the addition of symbolic links to FileSystem might address that, no?


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592551#action_12592551 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

> In har://hdfs-host:port/dir/foo.har/bar.har/file, what is the behavior if foo.har is indeed a directory and bar.har is an archive?

As I said before, I think it would be nice and not too difficult to make nested archives work.  Not essential, but convenient if its not too difficult.  So if you have hdfs://h:p/bar/* and you pack it into hdfs://foo/bar.har, then you pack that into hdfs://h:p/foo/* into hdfs://h:p/dir/foo.har, then har://hdfs-h:p/dir/foo.har/bar.har/file should either (a) contain the content of the original file if we implement nested archives, or (b) throw FileNotFoundException if we don't implement nested archives.  Is that what you were asking?

> Can we distinguish a directory ending with .har to be an archive only if it has index file in it.

If a path component of a har: uri ends with ".har" then I think it should be an error if it is not a ".har" format directory.  It's fine to have files named .har in HDFS that are not har-format, but if one tries to access them using the archive mechanism, we shouldn't silently ignore them, but rather throw a MalformedArchive exception, no?


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Mahadev konar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592471#action_12592471 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

 isnt it the same as mounts as I suggested? Wouldnt this also require changes to DFS Namenode?  I am thinking of implementing Archives without changes on namenode side. 


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602331#action_12602331 ] 

Hadoop QA commented on HADOOP-3307:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12383352/hadoop-3307_3.patch
  against trunk revision 663079.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 1 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2571/console

This message is automatically generated.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch, hadoop-3307_3.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592478#action_12592478 ] 

Andrzej Bialecki  commented on HADOOP-3307:
-------------------------------------------

Mahadev,

bq. Wouldnt this also require changes to DFS Namenode?

Perhaps not - the way I was thinking about it, this would be a function of a DFS client (i.e. the FileSystem on the API level). The FileSystem client would be aware of the current mounts (from Configuration), and for matching path prefixes it would translate file ops to operations on the archive file retrieved from the underlying configured FileSystem.

This way we don't have to modify the Namenode, we don't have to hack the url schemas, the URIs are transparent, and all processing load is at the cost of a user task, i.e. we don't create an additional load on the namenode. Additionally, the scope of the "mount" is the current configuration, e.g. a job, so it's not a permanent mount, it can be different for each job, and no cleanup or unmount is needed.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592469#action_12592469 ] 

Andrzej Bialecki  commented on HADOOP-3307:
-------------------------------------------

Why not do an equivalent of NFS mount on *nix? I.e. tell NameNode that any file ops under the mount point are handled by a handler, and the handler application (module in a task?) is initialized with the specs from the JobConf.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12601587#action_12601587 ] 

Hadoop QA commented on HADOOP-3307:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12382982/hadoop-3307_2.patch
  against trunk revision 661918.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    -1 findbugs.  The patch appears to introduce 9 new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2535/console

This message is automatically generated.

> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>         Attachments: hadoop-3307_1.patch, hadoop-3307_2.patch
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3307) Archives in Hadoop.

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592481#action_12592481 ] 

Doug Cutting commented on HADOOP-3307:
--------------------------------------

> Why not do an equivalent of NFS mount

On unix, mounts are not managed by a filesystem implementation, but by the kernel.  If we were to add a mount mechanism, we should add it at the generic FileSystem level, not within HDFS.

In fact, we already have a mount mechanism, but it only permits mounts by scheme, not at any point in the path.  We could add a mechanism to mount a filesystem under an arbitrary path, or even a regex like "*.har".  This could be confusing, however, since a path that looks like an hdfs: path would really be using some other protocol.  And I don't yet see that we need to add a new mount feature, since I think the existing one is sufficient to implement this feature.  Also, if we use "har:" or "hdfs-ar:" then it is clear that these are not normal HDFS files.

A feature that might be good to add to the generic FileSystem is symbolic links.  Then one could add a link in HDFS to an archive URI, thus grafting it into the namespace.  If one linked hdfs://host:port/dir/foo/ to hdfs-ar://host:port/dir/foo.har then one could list the files in the former to get the uris in the latter.  But that's beyond the scope of this issue.  This would be good future work to make transparent archives possible.


> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.