You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Arun C Murthy (JIRA)" <ji...@apache.org> on 2007/12/29 11:13:43 UTC

[jira] Created: (HADOOP-2501) Implement utility-tools for working with SequenceFiles

Implement utility-tools for working with SequenceFiles
------------------------------------------------------

                 Key: HADOOP-2501
                 URL: https://issues.apache.org/jira/browse/HADOOP-2501
             Project: Hadoop
          Issue Type: New Feature
          Components: io
            Reporter: Arun C Murthy


It would be nice to implement a bunch of utilities to work with SequenceFiles:

 * info (print-out header information such as key/value types, compression type/codec etc.)
 * cat
 * head/tail
 * merge multiple seq-files into one
 * ...

I'd imagine this would look like:
{noformat}
$ bin/hadoop seq -info /user/joe/blah.seq
$ bin/hadoop seq -head -n 10 /user/joe/blah.seq
{noformat}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2501) Implement utility-tools for working with SequenceFiles

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554975 ] 

Arun C Murthy commented on HADOOP-2501:
---------------------------------------

Sure... but we _do_ need more utils for SequenceFiles and I'm open to a discussion on how we are going to structure them.

> Implement utility-tools for working with SequenceFiles
> ------------------------------------------------------
>
>                 Key: HADOOP-2501
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2501
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>
> It would be nice to implement a bunch of utilities to work with SequenceFiles:
>  * info (print-out header information such as key/value types, compression type/codec etc.)
>  * cat
>  * head/tail
>  * merge multiple seq-files into one
>  * ...
> I'd imagine this would look like:
> {noformat}
> $ bin/hadoop seq -info /user/joe/blah.seq
> $ bin/hadoop seq -head -n 10 /user/joe/blah.seq
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2501) Implement utility-tools for working with SequenceFiles

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554923 ] 

Owen O'Malley commented on HADOOP-2501:
---------------------------------------

The data can be extracted via:

{code}
bin/hadoop fs -text blah.seq
{code}

Head is obviously done trivially using the unix head command.

> Implement utility-tools for working with SequenceFiles
> ------------------------------------------------------
>
>                 Key: HADOOP-2501
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2501
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>
> It would be nice to implement a bunch of utilities to work with SequenceFiles:
>  * info (print-out header information such as key/value types, compression type/codec etc.)
>  * cat
>  * head/tail
>  * merge multiple seq-files into one
>  * ...
> I'd imagine this would look like:
> {noformat}
> $ bin/hadoop seq -info /user/joe/blah.seq
> $ bin/hadoop seq -head -n 10 /user/joe/blah.seq
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2501) Implement utility-tools for working with SequenceFiles

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12559474#action_12559474 ] 

Enis Soztutar commented on HADOOP-2501:
---------------------------------------

Initially I intend to develop the following tools for the sequence files : 

- info    : give information about the file, including its header information(key class, value class, compressed, etc. )
- dump : dump the contents of the file to a text file, by calling toString() methods on the keys and values
- head  :  print n lines from the dump
- get     : get the value of the given key. Essentially we will provide a method 
{code}
   Writable get(WritableComparable key) { ... }
{code}
which calls equals() method for the key. However to be useful as a command line utility we will add a command line option 
{noformat}
  bin/hadoop  seq -get <key>
{noformat}
which does the comparison among string values. 

- filter <filter> : filter the input keeping only entries passing the filter. (fix and use https://issues.apache.org/jira/browse/HADOOP-449)
- stats : give statistics about the file, such as num of records, average key length, average value length, longest key, shortest key, ... more ?  
- sort : sort the file
- merge: merge multiple files

In addition to these, we can discuss the following : 
- tail : We can implement this but will it be worth the effort ? 
- same set of tools for map files : sure to go, but I think we can leave this to another issue. 
- provide implementation for the above tasks both w/ and w/o mapreduce.  Simply for some tasks mapred will be an overkill. Should we implement both versions? 

And finally any comments and suggestions are welcome. 

> Implement utility-tools for working with SequenceFiles
> ------------------------------------------------------
>
>                 Key: HADOOP-2501
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2501
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>            Assignee: Enis Soztutar
>
> It would be nice to implement a bunch of utilities to work with SequenceFiles:
>  * info (print-out header information such as key/value types, compression type/codec etc.)
>  * cat
>  * head/tail
>  * merge multiple seq-files into one
>  * ...
> I'd imagine this would look like:
> {noformat}
> $ bin/hadoop seq -info /user/joe/blah.seq
> $ bin/hadoop seq -head -n 10 /user/joe/blah.seq
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2501) Implement utility-tools for working with SequenceFiles

Posted by "eric baldeschwieler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561233#action_12561233 ] 

eric baldeschwieler commented on HADOOP-2501:
---------------------------------------------

It would be great if we had a way of catting out the key/values from a sequence file in the same format consumed by streaming.

It would be good if such a tool could take start and end offsets, so it could be used with splits.




> Implement utility-tools for working with SequenceFiles
> ------------------------------------------------------
>
>                 Key: HADOOP-2501
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2501
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>            Assignee: Enis Soztutar
>
> It would be nice to implement a bunch of utilities to work with SequenceFiles:
>  * info (print-out header information such as key/value types, compression type/codec etc.)
>  * cat
>  * head/tail
>  * merge multiple seq-files into one
>  * ...
> I'd imagine this would look like:
> {noformat}
> $ bin/hadoop seq -info /user/joe/blah.seq
> $ bin/hadoop seq -head -n 10 /user/joe/blah.seq
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-2501) Implement utility-tools for working with SequenceFiles

Posted by "Enis Soztutar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enis Soztutar reassigned HADOOP-2501:
-------------------------------------

    Assignee: Enis Soztutar

> Implement utility-tools for working with SequenceFiles
> ------------------------------------------------------
>
>                 Key: HADOOP-2501
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2501
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>            Assignee: Enis Soztutar
>
> It would be nice to implement a bunch of utilities to work with SequenceFiles:
>  * info (print-out header information such as key/value types, compression type/codec etc.)
>  * cat
>  * head/tail
>  * merge multiple seq-files into one
>  * ...
> I'd imagine this would look like:
> {noformat}
> $ bin/hadoop seq -info /user/joe/blah.seq
> $ bin/hadoop seq -head -n 10 /user/joe/blah.seq
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2501) Implement utility-tools for working with SequenceFiles

Posted by "Runping Qi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555372#action_12555372 ] 

Runping Qi commented on HADOOP-2501:
------------------------------------


bin/hadoop seq -head assumes the key/value classes implement toString() method?

bin/hadoop seq -merge must check whether the key/value classes of the unput files are the same.
It should report error if not.
Should the merge utility offer an option to mainain the order of the keys in the results?



> Implement utility-tools for working with SequenceFiles
> ------------------------------------------------------
>
>                 Key: HADOOP-2501
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2501
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: io
>            Reporter: Arun C Murthy
>
> It would be nice to implement a bunch of utilities to work with SequenceFiles:
>  * info (print-out header information such as key/value types, compression type/codec etc.)
>  * cat
>  * head/tail
>  * merge multiple seq-files into one
>  * ...
> I'd imagine this would look like:
> {noformat}
> $ bin/hadoop seq -info /user/joe/blah.seq
> $ bin/hadoop seq -head -n 10 /user/joe/blah.seq
> {noformat}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.